February 2024 Digest

Jan 30, 2024 8:24 pm

Consulting and Training


If you’re interested in consulting or training engagements then don’t hesitate to reach out to me at inquiries@symplectomorphic.com.


Upcoming Courses


I’ll be hosting an abbreviated suite of remote courses this summer, https://events.eventzilla.net/e/principled-bayesian-modeling-with-stan-2138610063. The material will cover foundations and a few advanced modeling techniques, including my gaussian process module that I wasn’t able to offer last year.


Recent Writing


Curious about how probability distributions transform? Ever wondered from where the term “marginal probability” comes? Confused by those “Jacobian” things that people keep talking about? Have I got some writing for you!


In my latest chapter I survey how probability distributions, representations of probability distributions, and probabilistic operations all change when the underlying space is transformed.


HTML: https://betanalpha.github.io/assets/chapters_html/transforming_probability_spaces.html

PDF: https://betanalpha.github.io/assets/chapters_pdf/transforming_probability_spaces.pdf


Along the way we’ll learn why expectation values are sometimes referred to as the “mean of a function” and all of the ways that we can characterize one-dimensional marginal/pushforward distributions.


Support Me on Patreon


If you would like to support my writing then consider becoming a patron, https://www.patreon.com/betanalpha. 


Probabilistic Modeling Discord


I’ve recently started a Discord server dedicated to discussion about (narratively) generatively modeling of all kinds, https://discord.gg/frdhnCfHv6.


Recent Rants


On Indifference


The “principle of indifference” is often presented as an intuitively obvious motivation for specifying “non-informative” prior models. Unfortunately that intuition quickly falls apart in many common applications. A long thread about applied probability theory!


Consider a mathematical space X and two well-behaved subsets A_{1}, A_{2} subset X that our domain expertise cannot distinguish. The principle of indifference is often used to argue that a non-informative probability distribution pi should allocate the same probabilities to those two subsets, 

 pi(A_{1}) = pi(A_{2}).


What if, however, the second subset is actually the disjoint union of other well-behaved subsets,

 A_{2} = cup_{i = 3}^{I} A_{i}?

Does our domain expertise distinguish A_{2} from A_{3}, …, A_{I} or do all of these subsets also seem equivalent?


In other words do we still want our non-informative probability distribution to give

 pi(A_{1}) = pi(A_{2})

or

 pi(A_{1}) = pi(A_{3}) = …. = pi(A_{I})?

We cannot have both at the same time!


This kind of discrepancy comes up often in practice. Someone arguing that the probability the world will end is 50% because “it will happen or it won’t” is allocating probabilities based on large subsets that aggregate many possible behaviors together.


On the other hand those arguing for lower probabilities are implicitly considering smaller subsets that capture more nuanced behaviors. There are many more ways for the world to not end than for it to end.


The underlying problem here is that a probability distribution has to define a _consistent_ allocation to _all_ well-behaved subsets. A probability distribution cannot consistently allocate the same probability to all well-behaved subsets. In order to define a “non-informative” probability distribution we have to choose a collection of subsets that we believe are indistinguishable.


On countable spaces, including finite spaces and countably infinite spaces like the integers, every probability distribution is completely specified by the allocations to the atomic subsets that consist of a single element. Probability allocations to all other subsets can then be consistently built up from these atomic allocations.


Once we’ve defined all of the irreducible elements in our space we might then argue that a non-informative probability distribution should not distinguish between the atomic subsets and hence assign them the same probabilities. Note that this immediately implies that subsets that contain more elements will be allocated more probability. Not all allocations are the same!


For finite spaces we can achieve this goal. If our space contains N elements then we can define a “non-informative” probability distribution that allocates probability 1 / N to each element. Indeed this simple setting is often implicitly assumed in introductory examples of the principle of indifference. In practice we still have to define our atomic elements, and different choices can lead to different notions of a “non-informative” probability distribution.


This construction, however, runs into a problem for countably infinite spaces. Allocating the same non-zero probability to an infinite number of elements requires an infinite amount of total probability. By definition the total probability always needs to be one. We can use this construction to define non-informative _measures_, such as the counting measure, but we cannot use it to define non-informative probability distributions. Probability distributions inherently require more information than we might expect!


When working on uncountably infinite space, such as the real numbers, the construction of a “non-informative” probability distribution encounters even more problems. On these spaces allocations to atomic subsets no longer constrain all of the allocations to larger subsets. In order to fully define a probability distribution we need to specify the allocations to enough larger subsets that the allocations to every other subsets is fixed by the consistency conditions.


For example on a bounded real line with a fixed metric we can construct interval subsets. If the intervals all have the same length then the metric cannot distinguish between them. Taking this as our notion of “ignorance” we could try to define a probability distribution from the requirement that the probabilities allocated to all intervals of the same length are the same,

 pi( [ x_1, x_1 + l ] ) = pi( [x_2, x_2 + l] ).


Conveniently this construction works for bounded real lines, allowing us to define a uniform probability distribution in the context of a particular metric. If we change that metric, however, then we change our notion of equal-length intervals and hence the resulting “uniform” probability distribution.


Unfortunately this construction doesn’t work for unbounded real lines. As in the case of countably infinite spaces we end up with too much total probability. We can, however, use this construction to define a uniform measure, which in the case of a real line is known as the Lebesgue measure.


To review: the “principle of indifference” is well-defined only in limited contexts. In order to apply it we first need to choose which subsets should be indistinguishable and in general there will be many distinct choices, each with different consequences. Moreover on infinitely large spaces many choices don’t actually lead to well-defined probability distributions.


Mathematically it’s often easier to model indifference with symmetries. Instead of saying that a probability distribution cannot distinguish between two subsets we can say that the probability distribution should be the invariant to any transformation that permutes those two subsets.


On finite spaces we can construct uniform probability distributions that are invariant to any permutation of the atomic elements. On bounded real lines we can construct uniform probability distributions that are invariant to the translations compatible with a given metric.


This perspective is useful theoretically because we can use mathematical tools to identify the natural symmetries of space, and hence natural candidates for defining “uniform” measures, if not “uniform” probability distributions when possible. We can also use these tools to identify when the desired symmetries are compatible with each other and can be used to construct well-defined probability distribution at all.


But I also think that the symmetry perspective is useful in practice because I find that it better emphasizes the subjective assumptions that underlie every attempt to define “ignorance”, “indifference”, and “non-informative”. In order to state which symmetries are important we have to make these implicit assumptions explicit and shatter any illusion that the resulting objects are objective.

Comments