February 2026 Digest

Jan 29, 2026 2:45 am

Upcoming Talk

On Tuesday, February 17th I'll be giving a talk and short workshop on (narratively) generative modeling for Princeton's psychology department, https://psychology.princeton.edu/news-events/2026/michael-betancourt-chief-research-scientist-symplectomorphic-llc. The talk (and maybe the workshop) will be live-streamed for those interested in attending remotely.

If you're impatient then you might be interested to know that the extensive cognitive science case study that I'll be reviewing during the workshop is available right now to my covector+ supporters over on patreon dot com, https://www.patreon.com/c/betanalpha…

Consulting and Training

Are you, or is someone you know, struggling to develop, implement, or scale a Bayesian analysis compatible with your domain expertise? If so then I can help.

I can be reached at [email protected] for any questions about potential consulting engagements and training courses.

Probabilistic Modeling Discord

I set up an open Discord server dedicated to discussion about (narratively) generative modeling of all kinds. For directions on how to join see https://www.patreon.com/posts/generative-88674175. Come join the conversation! Lately we’ve been discussing everything from tacos and ranking models to concussions.

Support Me on Patreon

If my work had benefited your own and you have the resources then consider supporting me on Patreon, https://www.patreon.com/c/betanalpha. Supporters have access to not only new material but also over six years of archived material, which I’ve just recently organized to be easier to navigate.

Recent Rants

On Scaling With Data

You’ve build a probabilistic model and, for a small data set and computed posterior inferences with Hamiltonian Monte Carlo. Everything works great and, flush with confidence, you throw the model against all of your data only for everything to go to hell. Why can scaling with data be so damn frustrating? Well, let’s explore.

Recall that Hamiltonian Monte Carlo generates Markov transitions by simulating a numerical trajectory. Each leapfrog step in a numerical trajectory requires evaluating the model. Moreover, in a dynamic Hamiltonian Monte Carlo algorithm the number of leapfrog steps varies as the sampler adapts to the local geometry of the posterior distribution.

The overall cost of running Hamiltonian Monte Carlo can be decomposed as

Cost = N_iterations * N_leapfrog_per_iteration * Cost_per_model_eval

Because we configure the number of iterations directly, the first term is not relevant to scaling considerations. The last two terms, however, can change, often in unexpected ways, as we add more data.

Let’s start with the cost of each model evaluation. For the simplest models the evaluation cost will grow linearity with the amount of data we include. For more sophisticated models, however, it can grow much faster. The cost of gaussian process models, for example, grows cubicly with each observation that we add at a new input. Similarly, some models require adding more parameters to accommodate new data, with each new parameter adding extra evaluation costs.

The cost of evaluating a model can sometimes be reduced with program optimization or parallelization techniques. Finding these optimization opportunities becomes easier with experience, but careful program analysis and run-time profilers can be extremely helpful.

What about the number of model evaluations for each iteration? Again for dynamic Hamiltonian Monte Carlo this depends on the local geometry of the posterior distribution. The more well-behaved the local geometry is the shorter trajectories will be, and the fewer times we’ll need to evaluate our model.

The geometry of the posterior distribution is, unsurprisingly, sensitive to the amount of data that we include. Interestingly, this sensitivity can be both beneficial and harmful!

When we’re modeling observations drawn from a homogeneous data generating process — i.e. independent and identically-distributed data — adding more data can improve the posterior geometry. In fact in some cases the geometry can actually improve faster than the cost of model evaluation increases. Overall this results in less computation as we add more data!

In less ideal circumstances, however, the addition of more data can worsens the posterior geometry. When both N_leapfrog_per_iteration and Cost_per_model_eval increase, scaling to larger data sets quickly becomes much more difficult if not outright impractical.

So when does adding more data causes the local geometry to get worse? One of the most common culprits is model misfit.

Because small data sets are not very informative, simple models tend to be more than adequately for extracting meaningful insights. As we add more data, however, we can resolve more and more details about the data generating process that simple models won’t be able to accommodate. This inadequacy often results in models awkwardly contorting themselves in an attempt to emulate those details. In turn these contortions typically manifest as nasty likelihood/posterior geometries which require longer numerical trajectories, and more model evaluations, to explore effectively.

To avoid these contortions we need to make our models more sophisticated as we add new data so that they can keep up with the complexities of the data generating process that we can resolve. This might mean, for example, introducing non-linear terms, adding group-wise variability to some parameters, modeling contamination of the data generating process, and more.

That said, it’s usually not clear what kind of complexities we might encounter as we add new data, and hence how to update a model to scale more effectively. Fortunately there are a few techniques that can help identify potential issues without blowing out all of your

computational resources.

Instead of running with one small test data set, for instance, we might run with multiple small test data sets. Does constructing posterior inferences from each data set take the same amount of time to process, or do some data sets take anomalously longer than the others? Do the posterior inferences from each fit agree with each other? Any discrepancies can help identify heterogeneities and idiosyncratic behaviors that might have to be modeled.

When considering larger data sets, it can be helpful to incorporate data incrementally instead of just jumping to the full data all at once. These smaller data sets are faster to fit and allow us to study the computational scaling directly. Does the computational cost grow consistently with each increment? Sudden jumps in computation often indicate that the newly added data isn’t well-fit by the existing model.

Another thing to look out for is the adaptation of the Hamiltonian Monte Carlo sampler. If the posterior geometry is bad enough, or even if some of the Markov chains are initialized in especially problematic neighborhoods, then the adaption of the step size and inverse metric during warmup can go off the rails. In particular, if the step size in a Markov chain is adapted to extremely small values then each numerical trajectories will explore poorly and add little to no information that might improve the adaptation. This results in a vicious cycle where that Markov chain basically just stalls while burning all the available compute.

Waiting days for a run to finish, only to see failed adaptation of the Hamiltonian Monte Carlo sampler, can be pretty demoralizing. Adding data in increments can help catch adaptation problems faster. So too can keeping an eye on the adaptation status during the run, when possible.

Adaptation problems can often be resolved by improving the model, but sometimes the most productive approach is to engineer custom Markov chain initializations to avoid problematic tail behaviors and ease the burden on the adaptation procedure. This can be helpful, for example, when more extreme model configurations make parts of the model numerically unstable.

Comments