November 2025 Digest

Oct 31, 2025 2:11 pm

Upcoming Courses

On Wednesday, December 10 I will be offering my comprehensive introduction to regression modeling at a steep discount in an effort to raise funds for World Central Kitchen and United Farm Workers. All you have to do to attend is make a donation of 50 USD to either organization and then send me a screen shot of the confirmation. Details about the course and registration process can be found on my website, https://betanalpha.github.io/courses/.

I greatly appreciating any sharing of the course details to your colleagues and communities!

Consulting and Training

Are you, or is someone you know, struggling to develop, implement, or scale a Bayesian analysis compatible with your domain expertise? If so then I can help.

I can be reached at inquiries@symplectomorphic.com for any questions about potential consulting engagements and training courses.

Recent Writing

I wrote about my company’s logo and a recent experiment with some topologically-accurate swag over on patreon dot com, https://www.patreon.com/posts/logo-history-and-141058753.

I also threw together a Bayesian version of Anscombe’s quartet, https://www.patreon.com/posts/anscombes-142252115.

Probabilistic Modeling Discord

I set up a Discord server dedicated to discussion about (narratively) generative modeling of all kinds. For directions on how to join see https://www.patreon.com/posts/generative-88674175. Come join the conversation!

Support Me on Patreon

If my work had benefited your own and you have the resources then consider supporting me on Patreon, https://www.patreon.com/c/betanalpha. Amongst other bonuses, over the past few months my supporters have enjoyed sneak previews of the material for the upcoming courses.

Recent Rants

On Pseudo-Random Number Generator Seeding

I’m seeing some misinformation about pseudo-random number generator best practices going around the internets. Let’s talk about why the pseudo-random number generator seed you use shouldn’t actually have any impact on your results and, consequently, you can choose whatever seed you damn well please.

Now the exact sequences of values pulled from a pseudo-random number generator very much depends on the chosen seed. The subtlety, however, is that your analysis shouldn’t be all that sensitive to the exact sequence of pseudo-random values generated.

The purpose of a pseudo-random number generator is not to generate arbitrary sequences, but rather sequences with particular ensemble properties. Formally averages over almost any* sequence should be able to approximate arbitrary expectation values to a certain degree of accuracy.

Because the generated sequences will vary with the initializing seed so too will the precise value of any empirical average over those sequences. The variation in the empirical averages, however, will be mostly contained within the accuracy limits. In other words if we take the numerical error into account then the empirical averages will be consistent regardless of the exact sequence we use, and hence regardless of the pseudo-random number generator seed used.

[*To be extremely pedantic any pseudo-random number generator can generate exceptional sequences with terrible estimation properties, hence the “almost any” caveat above. For a good pseudo-random number generator, however, it is almost impossible to engineer a seed that hits one of those pathological sequences. Moreover the probability of accidentally hitting a pathological sequence from an arbitrary seed is much smaller than the probability of so many other things that can go wrong.]

So, to summarize, if your analysis is using pseudo-random number generators to estimate expectation values (as it should be) and you are generating long enough sequences to ensure sufficient small numerical error (as you should be) then any seed-dependency will be negligible.

All of this said, there are absolutely situations where we can empirically observe pseudo-random methods being fragile and very seed dependent. Is that inconsistent with what we’ve discussed so far? No! These situations all come down to bad algorithms and methods, not poor seed-selection practices.

Consider, for example, a method that consumes multiple pseudo-random number generator sequences. A naive implementation of this method might initialize multiple pseudo-random number generators from different seeds. Regardless of how different the seeds might appear to be, this can result in strong correlations across the sequences, and those correlations can introduce awkward behavior in any resulting expectations.

But the problem here isn’t the seeds, it’s initializing multiple pseudo-random number generators. The properties of the most commonly available pseudo-random number generators are guaranteed only within a single sequence and not across sequences. In order to generate multiple sequences the best practice is to generate one long sequence and then split it into subsequences. Some pseudo-random number generators even allow these subsequence to be generated in parallel.

What about something like cross validation, where a data set is randomly and repeatedly split into order to construct some predictive performance score? Well this is really just a way to estimate a particular predictive expectation value. If the estimator is sufficiently well-behaved then any sequence of splits should result in constant estimates.

The problem is that these estimators are not always — dare I say not often — well-behaved. Fragile estimator performance is only magnified when not using enough splits, especially when the data set at hand is too small to allow for sufficiently many splits.

In this case some pseudo-random number seeds will result in better outputs than others, but in practice we won’t know which seeds will result in better fluctuations than others and so the choice of seed still has no practical consequence.

Sure an unscrupulous person can take advantage of cross validation fragility to hunt for a seed that yields better results, but again this is an estimator problem not a pseudo-random number generator seeding problem.

For a final example let’s consider Markov chain Monte Carlo. Most Markov chain Monte Carlo algorithms rely on pseudo-random number generators to fuel their exploration and, often, also initialize the starting values. If Markov chain Monte Carlo is well-behaved then the Markov chains will converge and produce consistent estimates regardless of the precise sequence of pseudo-random numbers that were used.

Markov chain Monte Carlo, however, is not always well-behaved. Many Markov chain Monte Carlo algorithms struggle with more complicated target distributions; multi-modality for instance is a particularly problematic feature. With a multi-modal target distribution any finite Markov chain might explore only part of the target distribution, resulting in formally inaccurate estimates. Different pseudo-random number generator seeds resulting in different Markov chain exploration which results in different Markov chain Monte Carlo estimates is not a problem with the pseudo-random number generator seeds but rather with the Markov chain Monte Carlo algorithm itself!

All of this is to say that any method for choosing a seed is equally adequate provided that seed is reported and the resulting pseudo-random number generator output is used properly. A common seed for every analysis is fine. A heuristic for changing the seed from analysis to analysis, say based on the current date or time, is fine.

Incidentally the same goes for the range of values for a seed. Modern pseudo-random number generator state spaces are so unfathomably large that any two-digit integer is equally as uncharacteristic as any six-digit integer.

Adding a little bit of robustness by running an analysis multiple times with different seeds as a check, just to see if the results are consistent, is a great way to identify potential estimator issues (and doesn’t introduce any harm provided you don’t try to do something foolish like average the results together…). That said it will always be more productive to understand the method you are using and how it can be engineered to ensure strong estimation performance in the first place.

If you want to read more then check out Section 2 of my Monte Carlo chapter, https://betanalpha.github.io/assets/case_studies/sampling.html#22_Pseudo-Random_Number_Generators. For even more detail I really like the Mellissa O’Neill's writing, https://www.pcg-random.org.

Comments