The Little Asterisk at the End of the Model

On p-values, FDR, and the quiet request that machine-learning results behave like a Western blot.

Abstract

On p-values, biological decorum, and the strange request that machine-learning results behave like a Western blot. A p-value is not a passport; it is, at best, one document in the folder.

There is a certain kind of question that appears, sooner or later, in computational biology.

“Yes, but what is the p-value?”

It arrives with the confidence of a lab staple. Like a pipette, or a freezer box, or the belief that every figure can be improved by adding one more panel. The p-value is familiar, compact, and socially legible. It gives the room something to point at. It has the little moral clarity of a traffic light: significant, not significant; proceed, do not proceed; publish, perhaps; perish, regrettably.

The problem is not that p-values are useless. They are not. The problem is that they are often asked to perform duties for which they were never trained.

A p-value does not tell us whether a hypothesis is true. It does not tell us whether an effect is large. It does not tell us whether a model is useful, whether a feature is biologically meaningful, whether an association will replicate, or whether the result deserves a place in the Discussion section with its shoes on. The American Statistical Association had to say this out loud, which is one of those facts about science that is both useful and faintly humiliating [1].

This becomes especially odd in large computational settings.

In a wet-lab experiment with modest sample size, a p-value may feel like a scarce object. In high-dimensional machine learning, it is often not scarce at all. Give a model enough edges, genes, voxels, cells, regions, ligand–receptor pairs, windows, folds, and derived features, and many things will become statistically incompatible with a null model, sometimes because they matter, sometimes because the sample size is enormous, sometimes because the observations are not as independent as the table politely suggests.

A p-value of 1 × 10⁻³⁰ may be impressive in the same way that a freight train is impressive. One should still ask where it is going.

In machine learning, the useful questions are often less ceremonial. How large is the effect? Does the model generalize? What happens under a different split? What happens under a different random seed? Does the effect survive a different cohort, a different atlas, a different preprocessing decision? Is the performance better than a simple baseline, or merely more expensive? Is the model learning biology, or is it learning the shape of the dataset’s accidents?

These questions do not fit as neatly into one number, which is partly why they are less popular.

Then there is multiple testing, that great administrative office of modern omics. False discovery rate correction is not the enemy. It is a reasonable response to an unreasonable number of opportunities to fool oneself. But in very high-dimensional settings, it can behave less like a scalpel and more like a municipal snowplow. When there are tens of thousands or millions of tests, weak but real associations can fail to survive correction, especially when effects are small, samples are limited, measurements are noisy, or hypotheses are correlated in ways the correction method only partly understands [2].

This does not mean one should throw away FDR. That is the sort of thing a desperate person says at 2 a.m. while staring at a heatmap. It means one should stop pretending that “survives correction” and “biologically true” are synonyms, or that “does not survive correction” and “meaningless” are the same thing.

Biology is not usually kind enough to arrange itself into one decisive test.

This is especially true in human neurodegeneration. A familiar criticism of computational work is that it shows correlation, not causation. This criticism is often correct, and also, depending on the tone in which it is delivered, a little theatrical. Yes, cross-sectional post-mortem data cannot establish temporal causality. Yes, associations between gene expression, pathology, imaging, and cognition are not causal mechanisms by default. Yes, confounding exists. One is tempted to ask whether anyone thought the opposite.

The difficulty is not that computational researchers have forgotten causality. The difficulty is that humans are inconvenient experimental systems.

We cannot longitudinally sample living human brain tissue, region by region, cell type by cell type, every six months, while also collecting perfect imaging, cognition, pathology, proteomics, and life history, and then intervene on ligand–receptor pathways to see what happens. Ethics, biology, death, cost, and reality have conspired against this design.

So we work with partial views.

A cohort gives us cognition over time. Another gives us post-mortem tissue. Another gives us imaging. Another gives us cell atlases from a few donors. Another gives us structural connectivity. We align, harmonize, model, validate where possible, and interpret with restraint. The result is not causation in the clean experimental sense. But neither is it nothing. It is structured evidence under constraint.

The real sin is not correlation. The real sin is pretending correlation is more than it is, or less than it is, depending on which direction helps the argument.

There are, to be fair, many things in computational biology that deserve suspicion. Data leakage is one of them. Kapoor and Narayanan’s work on leakage in machine-learning-based science found that leakage has affected hundreds of papers across many fields, sometimes producing wildly overoptimistic results [3]. In biomedical machine learning, leakage has a particular talent for dressing itself as good performance. The same subject appears in train and test. Feature selection is done before splitting. Normalization quietly borrows information from the test set. Hyperparameters are tuned until the validation fold begins to resemble a collaborator.

The model then performs beautifully, which is to say, suspiciously.

Random seeds are another small domestic matter that become less small when ignored. Stochastic training, random initialization, data shuffling, sampling, and augmentation can change performance and feature importance. Recent work has explicitly warned that random seeds can substantially influence machine-learning-based causal-effect estimation [4]. Yet many papers still report a single run, as though the seed were not part of the experiment but merely the weather.

Then come the baselines. The baseline is the dull friend who ruins the dinner party by asking whether the transformer actually helped. A good baseline is not glamorous. It does not arrive with a diagram of attention heads and latent structure. It asks only whether logistic regression, ridge regression, random forests, XGBoost, PCA, PLS, k-nearest neighbours, or even a mean predictor did most of the work already.

This is an excellent question and therefore often inconvenient.

The broader problem is cultural. Traditional biology wants significance. Machine learning wants performance. Statistics wants assumptions. Causal inference wants identification. Clinical science wants generalization. Software engineering wants a repository that can be run by someone other than the person who wrote it during a fever dream. Computational biology sits in the middle, trying to satisfy all of them, usually with a dataset that is too small in subjects, too large in features, too expensive to repeat, and too messy to explain without several supplementary figures.

This is not a reason to be careless. It is the reason to be more careful.

For machine-learning work in biology, the little asterisk at the end of the model should not be a p-value alone. It should include the split, the baseline, the leakage checks, the seed variance, the code, the data provenance, the effect size, the uncertainty, the external validation if available, and a sober account of what the model cannot prove.

A p-value is not a passport.
It is, at best, one document in the folder.

— § —

The Little Asterisk at the End of the Model

References