Models That Earn Their Complexity

A complaint about biological foundation models, and the fanfare that arrives before the baseline.

Abstract

On biology foundation models, benchmark failures, and the difference between scale and evidence. A reminder that the useful biological model is not the one with the most general name, but the one that survives the boring questions.

There is a particular kind of paper title that now arrives wearing a very large hat.

It has a foundation model in it. It has biology in it. It has, often enough, a diagram suggesting that genes, cells, patients, perturbations, pathways, and perhaps the better part of nature itself have been persuaded into a single latent space. The model is large. The pretraining corpus is larger. The abstract is confident. The benchmark table is, as benchmark tables tend to be, doing several things at once.

This is not to say that foundation models in biology are useless. That would be much too easy, and almost certainly wrong. Geneformer and scGPT, for example, are serious attempts to learn from tens of millions of single-cell transcriptomes, and the motivation is not absurd: biology is full of repeated structure, context dependence, and incomplete labels [1]. If a model could learn something general from such data, one would be unwise to object merely because the method is fashionable.

The trouble is that fashion has a way of arriving before tailoring.

A model trained on millions of cells may still fail to do the thing one quietly hoped it had learned. In perturbation prediction, a recent Nature Methods study compared several foundation and deep-learning models against deliberately simple baselines and found that none outperformed those baselines [2]. In zero-shot single-cell clustering, Geneformer and scGPT have been reported to perform poorly against conventional methods such as highly variable genes, Harmony, and scVI, and in some cases against random-weight versions of themselves [3]. Another benchmark found that single-cell foundation-model embeddings offered no consistent advantage for perturbation-effect prediction, particularly when the data shifted underfoot, as biological data have a nasty habit of doing [4].

There is something bracing about this. Not because failure is enjoyable, although in method development it is at least traditional, but because it returns the field to the old and useful question: compared with what?

Compared with a linear model. Compared with a carefully tuned baseline. Compared with a classifier that knows nothing of grand biological grammar but does know the train-test split. Compared with a representation so simple it arrives without parameters and, embarrassingly for the more elaborate machinery, performs rather well [5].

This is where the foundation-model story in biology differs from the one in natural language. A sentence has grammar in a way that a cell does not. A tissue is not a paragraph. A gene is not a word simply because both can be embedded. Biology has syntax, perhaps, but it is not one syntax. It is spatial, temporal, developmental, technical, pathological, and frequently under-sampled. It changes with dissociation protocol, batch, species, brain region, disease stage, death interval, and the unrecorded habits of whoever handled the tissue.

The model, naturally, is not told all of this. It is given the data.

This is why I distrust complexity that has not earned its keep. A transformer is not a sacrament. Attention is not interpretation. Pretraining is not biological understanding. A large model can memorize nuisance structure, learn the wrong invariances, improve a metric for reasons unrelated to mechanism, or become a very expensive way of finding the same signal that a simpler method would have found with less ceremony.

The genomic foundation-model literature is probably the more honest version of the story. DNA foundation models can perform well, especially on some human-genome classification tasks. But comprehensive benchmarking also shows unevenness: underperformance against simple CNNs on several multispecies and epigenetic tasks, trade-offs against specialized models, and unresolved interpretability problems [6]. That is not a verdict of uselessness. It is a verdict of specificity.

And specificity is what biology keeps asking for.

The useful biological model is not the one with the most general name. It is the one that survives the boring questions. What is the baseline? What is the split? What leaks? What shifts? What does the model learn when labels are removed? Does it generalize to another cohort, another tissue, another platform, another lab? Does its interpretation remain stable when the preprocessing changes? Does the biological story still hold when the fashionable architecture is replaced by something dull?

Dull is underrated. Dull is where many truths are kept.

I do not dislike biological foundation models because they are ambitious. I dislike the ambient presumption that scale itself is a kind of evidence. In biology, scale is often just a larger surface on which confounding can write. The model must still be made accountable to the tissue, the assay, the cohort, the baseline, and the question.

There is a good version of this future. It is not anti-foundation-model. It is anti-ornament. It uses large pretrained models where they reveal structure that simpler approaches miss, where the evaluation is severe, where the representation transfers under genuine distribution shift, and where interpretation does not collapse into heatmaps with better typography.

The rule is simple enough.

Build the large model, if the problem asks for one.
But make it earn the room it occupies.

— § —

Models That Earn Their Complexity

References