April 2025 Digest

Mar 28, 2025 3:15 pm

Upcoming Courses


In August I’ll be teaching four brand new open enrollment courses, with modules covering mixture modeling, survival modeling, pairwise comparison modeling, and ordinal modeling, https://www.eventzilla.net/e/advanced-bayesian-modeling-in-stan-2138661763.


Recent Writing


5 stars is better than 4 stars, but can we even define how much better it might be? Modeling ordinal outcomes like ratings is a subtle topic; fortunately I have a new chapter that dives directly into that nuance.


HTML: https://betanalpha.github.io/assets/chapters_html/ordinal_modeling.html

PDF: https://betanalpha.github.io/assets/chapters_pdf/ordinal_modeling.pdf


Without a metric to measure distances ordinal spaces are less structured then we might initially expect. In particular modeling a single set of ordinal probabilities is pretty much the same as modeling a single set of categorical probabilities.


Orderings, however, becomes much more relevant when we want _perturb_ baseline ordinal probabilities, modeling heterogeneous behaviors while maintaining desired patterns between neighboring ordinal probabilities.


In this chapter I discuss the mathematical foundations of ordinal spaces as well as useful strategies for modeling baseline ordinal probabilities and their structure-preserving perturbations. As is my custom I also go into detail about potential degeneracies while reviewing productive management strategies. Finally I demonstrate all of these concepts with carefully designed exercises.


My covector+ supporters Patreon also have access to a comprehensive livestream review of this new chapter.


Consulting and Training


Are you, or is someone you know, struggling to develop, implement, or scale a Bayesian analysis compatible with your domain expertise? If so then I can help.

I can be reached at inquiries@symplectomorphic.com for any questions about potential consulting engagements and training courses.


Probabilistic Modeling Discord


I have a Discord server dedicated to discussion about (narratively) generatively modeling of all kinds, https://discord.gg/nfhv8bAV. Come join the conversation!


Support Me on Patreon


If you would like to support my writing then consider becoming a patron, https://www.patreon.com/betanalpha. This past month I have been sharing drafts on some more technical mathematical topics that I will eventually use when writing about some subtle applied topics.


Recent Rants


On Mathematical Terminology


Historically math has had a bad habit of taking something defined in a particular circumstance then generalizing it but using the same exact terminology and often notation, even when the generalization is not unique. This can be extraordinarily frustrating when trying to understand a reference that is referring to a generalization, a generalization of a generalization, or even worse, when one is only versed in, or just expecting/assuming, the original special case. In particular the overloaded notation/terminology makes it difficult to figure out the precise context without already knowing the historical development of the field.


For example in the case of the factorial function one starts with a map from non-zero, positive integers to non-zero, positive integers,


 -! : [1, infty) -> [1, infty)

      n    |->  n! = 1 * 2 * ... * (n - 1) * n.


Now there are all kinds of applications that would be much easier to work through if there was a compatible map that worked on all non-negative integers, [0, \infty) -> [0, \infty).  


But there are an infinite numbers of ways to do this -- having 0 maps to any non-negative integer we want defines a different map. Some of these have nicer properties on the generalized input space [0, \infty) but none are a unique generalization of !. Instead of calling these generalized operations something else, however, people just pick a convention that's useful to them and then call the resulting generalization ! without any other context.


The same thing happens when trying to generalize beyond the integers to real numbers or even complex numbers. Ironically even though the complex numbers are more general than the real numbers they usually come with more implicit structural assumptions which restrict a unique generalization of !.


Any time you see a controversial/non-unique edge case or generalization it's okay to step back and recognize the well-defined starting case and the possible generalizations and then use different notations/terminologies for each if that helps you to avoid confusion. I do it all the time!


Summation isn't generalized by integration. Summation is generalized by Riemann integration which is generalized by Riemann–Stieltjes integration which is generalize by Lebesgue integration. Or maybe Riemann integration is generalized by integration of differential forms over chain complexes which is then generalized by Lebesgue integration? 😳


Strive to be better by not abusing notation as if it were mandatory.



On Explaining Inferences


Inferences can rarely, if ever, be written as a series of simple insights/explanations. The problem is that conditioning is not a local probabilistic operation but rather a global one that couples all of the individual parameters together into a complex mess even for most simple models.


A narratively generative model distinguishes a particular conditional decomposition of the full Bayesian model into smaller, typically more interpretable pieces. In general we start by breaking the full Bayesian model down into an observational model and prior model.


 p( theta_1, ..., theta_I, y_1, ..., y_N)

  = p( y_1, ..., y_N | theta_1, ..., theta_I).


Then we can break these pieces up into even lower-dimensional conditional distributions, each modeling one "step" in the latent data generating process. This compartmentalizes the full model into pieces that are not only simpler mathematically but also easier to interpret. For example interpreting a term like p(theta_3 | theta_8, theta_12) requires considering only three variables, two of which are bound to particular values.


Inferences, however, aren't constructed from these individual, interpretable pieces. Instead they are derived from that full Bayesian model all at once,


 p(theta_1, ..., theta_I | tilde{y}_1, ..., tilde{y}_N )

  \propto p(theta_1, ..., theta_I, tilde{y}_1, ..., tilde{y}_N ).


Note: I'm going to assume Bayesian inference here, as it is the one true inference, but the same thing happens with inferences based on likelihood function or even consistent joint modeling of multiple estimators.


There's no natural "flow of information" from the data to the parameters. Sure conditioning one observation immediately constraints the parameters that inform it. A parameter that informs multiple observations, however, is subject to multiple overlapping constraints that have to be harmonized.


This is why _marginal_ inferences for any individual parameter are so useful. They consistently incorporate all of these complex inferential interactions into a convenient summary. On the other hand this is also why a bunch of marginal inferences do not provide enough information to inform joint inferences....


Now mathematically we can always decompose the joint posterior into a series of conditional posteriors


 p(theta_1, ..., theta_I | tilde{y}_1, ..., tilde{y}_N )

  =  p(theta_1 | theta_2, ..., theta_I, tilde{y}_1, ..., tilde{y}_N )

   * p(theta_2 | theta_3, ..., theta_I, tilde{y}_1, ..., tilde{y}_N )

   * ....


That said there are I! equally valid conditional decompositions, each of which offers a different "explanation" for the inferred behaviors of any particular parameter.


Moreover it's difficult to characterize these conditional posterior distributions in practice. For example joint MCMC sample can't be transformed into conditional MCMC samples; instead one has to rerun MCMC for each conditional piece!


All of this becomes even worse when one wants to make inferences or predictions that generalize beyond the original circumstance of the data, such as the "interventions". Firstly in most systems one cannot simply fix some of the parameters without modifying the behavior of the others. One has to jointly _model_ the intervention along with the original data generating process and then let those complex inferences happen all at once. The magic of Bayes' Theorem is that it provides a way to directly quantify the joint parameter behaviors that are consistent with any observed data. The limitation is that these joint inferences generally don't decompose into smaller, more interpretable bits.


Moreover all of this presumes that at the model is actually an adequate representation of the true data generating process. Even mathematically well-defined inferences from inadequate models can exhibit all kinds of counterintuitive behaviors. Trying to "explain" these is less than meaningful.


Ultimately "why" inferences for a parameter behave the way they do is much less meaningful than the assumed model for any past and future data generating processes, the provenance of the observed data, and the inferential/predictive consequences for the behaviors most relevant to stakeholders.


For more on building interpretable models see my chapter on narratively generative modeling,https://betanalpha.github.io/assets/case_studies/generative_modeling.html, and my recent workshop on the topic, https://www.youtube.com/watch?v=92oSUaZggKs. The workshop material in particular includes a few updates.

Comments