May 2024 Digest
Apr 29, 2024 2:09 pm
Consulting and Training
If you are interested in consulting or training engagements then don’t hesitate to reach out to me at inquiries@symplectomorphic.com.
Upcoming Courses
Last chance to register for the first modules of my 2024 remote courses, https://events.eventzilla.net/e/principled-bayesian-modeling-with-stan-2138610063!
Upcoming Office Hours
On Tuesday May 28th at 1:30 PM EDT I'll be hosting an office hours live stream where I’ll field questions on any statistics topics including, but not limited to visualization, model building, model checking, model implementation, and statistical computation. For more details on the livestream, including how to ask questions, see https://www.patreon.com/posts/103241539.
Recent Writing
Did you forget your keys? Did you leave your oven on? Did you include all of the observations in your analysis?! In my latest chapter I discuss how to model observations obscured by a selection process.
HTML: https://betanalpha.github.io/assets/chapters_html/modeling_selection.html
PDF: https://betanalpha.github.io/assets/chapters_pdf/modeling_selection.pdf
Models with selection process introduce nasty integrals which are often difficult, if not impossible, to evaluate in practice. Consequently they often referred to as “intractable” models.
In this chapter I show how these intractable integrals arise and discuss some of the, sadly frustrating, strategies for approximating these integrals, and hence implementing these models in practice.
Unlike many of my pieces this one doesn’t end with a satisfying conclusion. Selection models in more than a few dimensions are just fundamentally difficult no matter how often they arise in real analysis. You can’t always get what you want and what not.
Support Me on Patreon
If you would like to support my writing then consider becoming a patron, https://www.patreon.com/betanalpha. Right now covector+ supporters have early access to a new case study (and another in the not too distant future).
Probabilistic Modeling Discord
I’ve recently started a Discord server dedicated to discussion about (narratively) generatively modeling of all kinds, https://discord.gg/W2QVJaV6.
Recent Rants
On Walking Backwards From Regression
The topic of uncertain or incomplete covariate observations comes up every now and then in social media threads but I don't have any long-form writing on it at the moment. That said the basic concepts are pretty straightforward.
Technically a full regression model is specified by the joint probability distribution
p(y, x | theta) = p(y | x, theta) p(x | theta).
Assuming no confounders this becomes
p(y, x | eta, gamma) = p(y | x, eta) p(x | gamma).
If all we care about is predicting the missing y accompanying an observed x then we can drop the marginal covariate model p(x | gamma) and focus entirely on the conditional variate model p(y | x, eta). With some assumptions about the form of p(y | x, eta) we get regression/curve fitting.
If we are interested in anything else then we have to go back to the joint model and confront p(x | gamma). For example if the x are not observed "perfectly" then we need to treat the observations as a second data generating process and integrate it into the joint model.
p(y_obs, x_obs | x, eta, gamma, lambda)
= p(y_obs | x, eta)
* p(x_obs | x, lambda)
* p(x | gamma).
Because we’re conditioning on x and not x_obs the marginal covariate model p(x | gamma) can no longer be safely ignored.
Similarly if x is multidimensional and we observe only some components, x = (x_obs, x_miss), then we have to consider the model
p(y_obs, x_obs | x_miss, eta, gamma, lambda)
= p(y_obs | x_obs, x_miss, eta)
* p(x_obs | x_miss, gamma, lambda)
* p(x_miss | gamma, lambda).
In particular _all_ imputation models are implicitly of this form, although the actual assumptions made about p(x_obs | x_miss) are often obscured and in many cases also mathematically inconsistent. This is why some imputation algorithms may not "converge”.
Again the basic model structure is straightforward. Having to model p(x), let alone p(x_obs | x) or p(x_obs | x_miss), however, can appear to be overwhelming compared to regression and its many simplifying assumptions.
This illusionary gap only gets worse when we consider confounding, where parameters are shared between the different model components which couples their behavior of the conditional variate model p(y | x, theta) and marginal covariate model p(x | theta) together.
In my opinion this is one reason why teaching regression early is more deleterious than beneficial. It locks people into a fragile simplicity which then makes it too easy to dismiss more elaborate models as unnecessarily complex, regardless of the actual application details.
On Data Locomotion
Data does not drive anything. Inferences -- including numerical summaries, visualizations, mathematical quantifications of uncertainty, etc -- drive insights, understanding, and decisions, and those inferences are informed by data _only in the context of modeling assumptions_.
Data does not speak for itself. Data are meaningless numbers without an assumed context of their provenance. Explicit or implicit modeling assumptions establish that context, allowing very talkative inferences to take form.
You can empirically center data all you want, but without enveloping that core with modeling assumptions you can't connect it to meaningful consequences. Only be contrasting those assumptions to the observed data can we realize threads of insight from data.
Data is important but we cannot utilize data without modeling assumptions. If you don't think that you're making assumptions then you're just relying implicit assumptions, and implicit assumptions are unverifiable assumptions.
It's only by acknowledging, communication, and critiquing our assumptions that we can separate anecdote from insight.