November 2024 Digest

Oct 28, 2024 2:22 pm

Upcoming Course


On Wednesday November 6 I will be offering my comprehensive hierarchal modeling course at a steep discount in an effort to raise funds for World Central Kitchen and Doctors Without BordersAll you have to do to attend is make a donation of 50 USD to either organization and then send me a screen shot of the confirmation. Details about the course and registration process can be found on my website, betanalpha.github.io/courses/.


Also thanks to some generous sponsors a limited number of free registrations are also available. Priority will be given to black, indigenous people of color in high-income countries or those from low and middle-income countries. For more information see the course page.


Video Recommendation


A member of my probabilistic modeling Discord server found this awesome interview between I.J. Good and (a young) Persi Diaconis that I highly recommend, https://www.youtube.com/watch?v=VQnrLWYMQB4.


In my opinion Good’s perspectives on statistics are highly under-appreciated and remain relevant in contemporary applications. For example he’s one of the only people I’ve seen write about the costs of domain expertise elicitation and the practical consequences.


I particularly liked the discussion of the likelihood principle — “I believe in the likelihood principle given the right statistical model" — and the need for expected utilities to account for the time and energy costs of developing a model.


For those who haven’t read much of Good’s work before many of his classic papers and essays are collected in “Good Thinking” which is available as an inexpensive Dover paperback, https://store.doverpublications.com/products/9780486474380.  Another strong recommendation from me.


Recent Code


I recently updated my Markov chain Monte Carlo analysis and diagnostic tools, https://github.com/betanalpha/mcmc_diagnostics. New features include the ability to overlay expectand pushforward histograms, an ensemble quantile estimator, an implicit subset probability estimator, and a general expectand pushforward evaluation function.


Besides the code itself the GitHub repository also includes HTML and PDF files that document and demonstrate of the functionality of the tools. If you have an opportunity to use the code then all feedback is encouraged and welcome!


The code nominally supports RStan, PyStan2, and PyStan3 but by re-implementing just a few functions it can be easily extended to any Markov chain Monte Carlo code in R or Python.


Recent(-ish) Writing 


Over the past few months I’ve released a series of analysis that demonstrate how the elegant philosophy of Bayesian inference can actually be applied beyond the idealized examples in introductory texts and tutorials.


The first analysis is based on analyses that are common in marketing,

https://betanalpha.github.io/assets/chapters_html/customer_conversion.html. There’s even an accompanying video where I introduce generative modeling techniques, with a discussion of "causal" modeling, and then step through the analysis, https://www.youtube.com/watch?v=92oSUaZggKs.


Next we have a relatively simple analysis of tree growth that demonstrates the application of Bayesian inference in ecology using real data,

https://betanalpha.github.io/assets/chapters_html/tree_diameter_growth_analysis.html. This analysis was used in an interactive workshop that I lead this year.


In this analysis I analyze the fairness of the warped die that I purchased for company swag,

https://betanalpha.github.io/assets/chapters_html/die_fairness.html. Come for the application of Bayesian decision theory, stay for table top role playing games and way more simplex geometry than you ever could have wanted.


Finally we have an analysis of retro video game speed-running,

https://betanalpha.github.io/assets/chapters_html/racing.html. Here I use not winners or places but rather precise race times to infer player skills and then use those inferences to inform rankings and subtle, counterfactual predictions.


PDF versions of these pieces and a wealth of other writing on probability and statistics are also available on my website, https://betanalpha.github.io/writing/.


Recently Available Talk


Curious about stepping beyond the confines of regression techniques but not sure how to proceed? Earlier this year I live-streamed a talk where I discussed the limitations of regression methods and how we can transcend those limitations with narratively generative modeling, all in the context of an explicit marketing example.


You can find a link to the livestream as well as a copy of the slides on patreon dot com, https://www.patreon.com/posts/98613437. This and other freely-available online talks can also be found no my website, https://betanalpha.github.io/speaking/.


Consulting and Training


If you are interested in consulting or training engagements or even commissioning me to create a presentation on a topic of interest then don’t hesitate to reach out to me at inquiries@symplectomorphic.com.


Probabilistic Modeling Discord


I have a Discord server dedicated to discussion about (narratively) generatively modeling of all kinds, https://discord.gg/BW9SnSZf.


Support Me on Patreon


If you would like to support my writing then consider becoming a patron, https://www.patreon.com/betanalpha.


Recent Rants


On Non-parametric Modeling


I hate to disappoint anyone but “non-parametric” and/or “asymptotic” are not synonyms for “I don’t have to check my modeling assumptions”. Indeed non-parametric and asymptotic arguments don’t often mix well in the actual world of finite data. Let’s discuss why.


Firstly for anyone who might not be already familiar with the terms, to what exactly do “non-parametric” and “asymptotic” refer?


In general “non-parametrics” refers to models of functional behavior, which can be applied to anything from the baseline function in a regression model to entire probability mass functions.


In a parametric model of functional behavior we start with a parametric family of functions, f(x; theta). Then any probabilistic model over the parameters, p(theta | alpha), induces a probabilistic model over function outputs,

 p(f(x) | alpha) = int d theta p(theta | alpha) f(x; theta).


On the other hand non-parametric models of functional behavior approach a probabilistic model over function outputs p(f(x) | alpha) directly. That model might be consistent with an intermediate parametric family of functions but it doesn’t have to be.


Note the inherit confusion in the terminology: “non-parametric” models do indeed feature parameters, they’re just not parameters that configure some explicit family of functions.


One of the potential advantages of non-parametric functional models is that they can be extremely flexible, capturing far more functional behaviors than any explicit parametric model that one might be able to construct.


So, what about asymptotics? Well asymptotics is the consideration of inferences in the presence of arbitrarily many, perfectly repeated observations from a fixed data generating process.  


The general argument is that if a model contains the true data generating process then in the limit of finite observations inferences derived from that model will always converge to that truth, no matter what observations we happen to make.  


In particular in this case we don’t lose much by characterizing our inferences with point estimates or intervals derived from information at a single point.


There are many reasons why asymptotic arguments have limited applicability in practice. For example in practice it’s rare that we can perfectly repeat a measurement a few times with the system being studied held static, let alone arbitrarily often.


One of the weakest aspects of asymptotic arguments, however, is need for the model to contain the true data generating process. A linear model cannot converge to a non-linear truth no matter how many perfectly replicated observations we have.


This latter challenge is one reason why more and more people have been reaching for non-parametrics. The theoretical promise is that the flexibility of non-parametric functional models allows for the construction of probabilistic models that are sufficiently complex to capture even sophisticated true data generating processes and allow asymptotic arguments to be viable.


Of course in practice we can never actually achieve the asymptotic limit. The practical relevance of asymptotic arguments is not the limit itself but rather how well the asymptotic behaviors can approximate inferences derived from only a finite number of observations.


From this perspective the flexibility of non-parametric models is a double edged sword. In general the more flexible a model is the more of those repeated observations we need to constrain the behaviors and allow for asymptotic approximations to be reasonably accurate.  


Without enough data undesired functional behaviors in the non-parametric model will tend to overwhelm the more reasonable ones. This results in nasty likelihood functions that are extremely sensitive to minute details of the observed data are no where near well-characterized by point estimates.


In other words while non-parametric models might make the asymptotic limit more robust it also pushes the asymptotic limit further away from practical relevance!


Unsurprisingly problems only worsen when the flexibly of the model is limited by components other than the functional behavior being captured by a non-parametric model. For example no matter how flexible the baseline function is a regression model will always be limited by the assumption of no confounding behaviors.


None of this is to say that non-parametric models are not useful. Some non-parametric models are sufficiently interpretable that, with care, we can construct non-parametric models that suppress undesired functional behaviors by construction, allowing for smaller data sets to constrain the remaining reasonable functional behaviors. 


This is, for example, one of the powerful features of gaussian processes. For much more discussion on this point see https://betanalpha.github.io/assets/case_studies/gaussian_processes.html.


That said this kind of regularization is unfortunately a modeling assumption. A modeling assumption that we have to acknowledge, take responsibility for, and validate instead of hand-waving away.


On Walking The Frequentist Walk


Maybe it’s asking for a bit too much, but I think it’s reasonable for those dismissing Bayesian methods in favor of frequentist methods toactually be using proper frequentist methods. Buckle in for a long thread about walking the frequentist walk.


Frequentist methods are based on _estimators_. An estimator is just a function that maps data to some numerical output which may or may not be associable with some meaningful property of the system being studied, hence may or may not actually “estimate” it. For example an estimator might output points in, or subsets of, the relevant model configuration space. A useful estimator would then output values close to the true model configuration, or subsets that contain the the true model configuration. Estimators can also output values/subsets of meaningful quantities derived from the model configuration space. Again the utility of any given estimator is how well those outputs match, or “estimates” in some sense, the desired behavior.


Separating generic estimators from useful estimators is exactly where the “frequentist” in frequentist methodology comes into play. A frequentist calibration studies the output of an estimator across a range of potential observations, comparing those outputs to desired behaviors.


Formally we need to complement an estimator 

  hat{theta} : Y -> Z 

with an observational model p(y; theta) and a loss function 

  L : Z x \Theta -> R

that quantifies how poorly the output z in Z is if theta in Theta is the true model configuration.


Before we collect any data we don’t know what y will be. If assume a true theta, however, then can quantify the potential observations with the observational model p(y; theta). In particular we can compute an expected loss

  bar{L}(theta) = \int dy p(y; theta) L( z = hat{theta)(y) , theta).


In practice also we don’t know what the true theta actually is, but we can scan across all possible theta to derive a _worst case_ expected loss,

  L* = min_{theta in Theta} bar{L}(theta).


If the worst case loss is acceptable in a given application then the estimator will provide good enough estimates no matter what data we observe! No sarcasm intended, this is a practically useful result. That said any claims of estimator adequacy rely on our computing the worst case loss accurately _and_ the true model configuration actually being in the assumed observational model. Both conditions are difficult to validate in practice.


Note also that once we embrace a strict frequentist interpretation of probability theory this kind of calculation is about all we can do to quantify estimator performance. In particular we are not allowed to consider any kind of average summary of bar{L}(theta)! Averaging is an operation in probability theory, and in the frequentist philosophy one cannot apply probability theory to spaces of “fixed” quantities like a model configuration space.


If you haven’t seen this full construction before then it might appear esoteric, but is in fact the basis of all frequentist estimation methods. Combine a point estimator with a squared error loss function, for example, and you get mean squared error. Similarly combine a subset estimator with an exclusion loss function and you get coverage.


Because a frequentist calibration depends on an assumed observational model, not to mention the dangerous assumption that said observational model is detailed enough to exactly contain the true data generating process, the performance of an estimator is not universal. An estimator might be adequate or outright powerful in some circumstances and then terrible in others. Indeed one of the most common abuses of frequentist method is the assumption that expected performance for one circumstance, especially an over-simplified one designed to facilitate the necessary calculations, generalizes to any other circumstance.


Instead of people designing and calibration their own estimators they fall into the habit of treating pre-built estimators as black boxes, applying them in analyses without ever considering whether or not they will actually estimate anything well enough.  When people can pick and choose amongst a whole selection of black boxes the reliability of the resulting analyses becomes even more suspect.


Personally I’m not a fan of limiting inference to estimators in the first place. I prefer an informative posterior distribution that allows me to consistently extract all kinds of insights _at the same time_. Specifically I prefer to propagate uncertainty everywhere instead of bottlenecking inferences through the lossy output of an estimator, allowing for inferences that are not only more informative but also more robust.


That said I am not against a good calibration! When dealing with a prospective analysis where the data has not yet been observed studying the range of potential posterior behaviors is extremely useful. For example when designing an experiment one might tune the, well, experimental design until the undesired posterior behaviors are rare if not absent entirely.


Frequentist calibrations, however, are too rigid for my tastes. In addition to using more qualitative performance criteria I want to be able to use domain expertise to constrain the range of model configurations I consider instead of having to resort to worst case behavior.


Fortunately all of these features can be accommodated in a Bayesian calibration, where we study how inferential outcomes vary across an entire prior predictive distribution. As a bonus all of the operations in a Bayesian calibration are probabilistic, which greatly facilitates practical implementations even with complicated models.


We can even apply a Bayesian calibration to an estimator if we really wanted to! The sky is the limit when our mathematical tools are not constrained by philosophical restrictions and we have the full power of probability theory at our disposal.


For much more on frequentist estimators see Section 2 of https://betanalpha.github.io/assets/case_studies/modeling_and_inference.html#2_frequentist_inference.  I also write about Bayesian calibration in Section 3.3 of that same chapter, https://betanalpha.github.io/assets/case_studies/modeling_and_inference.html#33_bayesian_calibration.

Comments