January 2025 Digest

Dec 30, 2024 4:33 pm

Recent Writing

Pairwise Comparison Modeling

Comparison may be a thief of joy, but it is an undoubtably interesting statistical modeling problem. In my latent chapter I go deep into the general problem of modeling comparisons between pairs of abstract items.

HTML: https://betanalpha.github.io/assets/chapters_html/pairwise_comparison_modeling.html

PDF: https://betanalpha.github.io/assets/chapters_pdf/pairwise_comparison_modeling.pdf

Pairwise comparison models show up in a diversity of applications, from predicting the outcome of sporting events to evaluating student testing. In this chapter I consider a general approach for building pairwise comparisons models and then discuss the potential inferential degeneracies and productive management strategies.

Along the way I discuss the special cases of Bradley-Terry, Elo, and item response theory, not to mention a little bit of graph theory. You know, just as a treat. At the end I demonstrate the implementation of these techniques with a series of extremely detailed examples.

2024 Recap

On Patreon I reviewed my new writing from the past year, in case you might have missed anything, https://www.patreon.com/posts/118954156.

Consulting and Training

If you are interested in consulting or training engagements or even commissioning me to create a presentation on a topic of interest then don’t hesitate to reach out to me at inquiries@symplectomorphic.com.

Probabilistic Modeling Discord

I have a Discord server dedicated to discussion about (narratively) generatively modeling of all kinds, https://discord.gg/QKmrk4hy.

Support Me on Patreon

If you would like to support my writing then consider becoming a patron, https://www.patreon.com/betanalpha. Right now covector+ supporters have early access to a new case study on modeling star rating data.

Recent Rants

On Bayesian Learning

Heuristic appeals to Bayesian reasoning, especially the phrase “updating my prior”, seem like they are becoming increasingly popular these days. Formal Bayesian inference, however, is more than just updating a prior model.

To be fair Bayesian inference is often presented as the updating of a prior distribution p(theta) by a likelihood function L_{tilde{y})(theta),

p(theta | tilde{y} ) propto L_{tilde{y})(theta) p(theta).

Here the prior model quantifies how consistent each model configuration, theta, is with the available domain expertise while the likelihood function quantifies how consistent each theta is with the observed data, tilde{y}. In other words this Bayesian updating rule quantifies how compatible each model configuration is with both the included domain expertise and observed data.

This update, however, is not the full story. Bayesian inference is derived from an application of Bayes’ Theorem, and Bayes’ Theorem is actually a much more general statement about the conditioning of joint probability distributions. Formally Bayes’ Theorem states that for a joint probability distribution

p(y, theta) = p(y | theta) p(theta)

and _any_ tilde{y} we have

p(theta | tilde{y} ) propto p(tilde{y}, theta) = p(tilde{y} | theta) p(theta).

Now for a given observation tilde{y} the partial evaluation of the observational model defines a likelihood function,

L_{tilde{y})(theta) propto p(tilde{y} | theta),

and so we can indeed interpret the left-hand side as an updating of the prior model by the realized likelihood function. At least within the singular context of a particular observation.

The important insight here is that in Bayesian inference one doesn’t construct the likelihood function directly. Rather one _derives_ it from the conditional probability distribution p(y | theta).

In a statistical setting p(y | theta) defines a collection of probability distributions over the observational space, each quantifying a mathematical story of how data could be generated. A realized likelihood function quantifies how consistent these stories are with a particular observation.

Because the full scope of Bayes’ Theorem includes all possible observations a Bayesian model has to consider those possibilities, modeling either p(y | theta) and p(theta), or equivalently p(y, theta), even if we only want to apply it to a particular observation. That said this modeling requirement has some nice benefits.

Firstly because the observational model p(y | theta) directly models the assumed data generating process is it much more interpretable than any realized likelihood function. The answer to “from where did your data come?” directly informs p(y | theta) but only indirectly informs L_{tilde{y})(theta).

Secondly we can use p(y | theta) to simulate hypothetical observations for any theta. Moreover p(theta) informs which of those simulations are more relevant than others. Together these objets provide a unified foundation for simulation studies, experimental design, and calibration.

Any so-called Bayesian analysis that doesn’t directly specify an observational model is specifying one implicitly. Implicit observational models obscure the assumptions, making them difficult to acknowledge let alone critique.

While all of this might sound a bit abstract you might already be doing the right thing. For example most probabilistic programs don’t actually specify a likelihood function directly but rather p(y | theta) and p(that) or even p(y, theta). Likelihood functions and posterior distributions are then derived by partially evaluating these programs on the observed values of data variables.

One prime example of this is the Stan modeling language. No Stan program defines a likelihood function directly! Instead each program specifies a joint density function p(y, theta) which is then partially evaluated on given values of y to define an unnormalized posterior density function p(theta | tilde{y}). Algorithms like Hamiltonian Monte Carlo then use p(theta | tilde{y}) and its gradients to estimate posterior expectation values, allowing us to extract and communicate useful insights about the posterior distribution.

While this thread has focused on Bayesian inference its lessons carry over to frequentist methods as well. Frequentist estimators are not calibrated against a likelihood function but rather an assumed observational model p(y ; theta). A statement like “the estimator hat{theta is unbiased” implies that

\int dy p(y; theta) [ hat{theta)(y) - theta ] = 0

for all model configurations theta. In this case we never even construct a likelihood function!

Likelihood functions absolutely play a role in Bayesian inference, but only an intermediate one. We do not start with a likelihood function nor do we end with one. Instead we start with an observational model that allows us to do much more than just update the prior model for a particular observation. Ultimately simply changing one’s opinion/perspective/belief in the presence of new information is not inherently Bayesian. Bayesian inference is a particular, systematic and coherent method for learning that is powerful but not trivial to implement in practice.

Comments