Our Take
The field has spent five years optimizing algorithms while ignoring that correlational public datasets can't answer causal questions—and animal models fail at predicting human fibrosis outcomes.
Why it matters
As FDA signals openness to replacing animal testing with human models and organoids, companies that can generate reproducible, causal human data at scale will own the translation gap between prediction and clinical success. This matters now because the regulatory window is opening.
Do this week
Drug discovery teams: audit your training datasets for causal signal versus correlation, and if you're relying on animal models for fibrosis or liver toxicity, map out a 12-month transition to primary human tissue experiments paired with sequencing.
Ochre Bio and Lexogen built the bottleneck that matters
Ochre Bio, a biotech company focused on liver disease, partnered with Lexogen (an RNA sequencing specialist) to generate what both companies describe as one of the world's largest human liver functional genomics datasets. The dataset centers on large-scale gene perturbation experiments in primary human hepatocytes across multiple donors and disease states (company-reported).
Rather than treat disease as a uniform phenomenon, the team deliberately introduced donor variability into the experimental design. They also chose to work with actual human tissue: cultured diseased liver tissue in Asia and organ perfusion systems in New York that keep whole human livers alive. The goal was not to generate quantity, but to map causal relationships—what happens when specific genes are knocked out in multiple human liver donors under different disease conditions.
"We are obsessed with complexity," said Quin Wills, CEO of Ochre Bio. "We do not believe that you can appropriately model very complex processes like fibrosis in very simple cell culture models or inappropriate animal models."
The partnership illustrates a deliberate choice: the sequencing partner is not interchangeable. Lexogen's role was to ensure the dataset remained clean, comparable across samples, and fit for computational interpretation. Data quality, experimental reproducibility, and workflow consistency determined whether the final dataset would be usable for machine learning.
Algorithms without causal data are pattern-matching, not drug discovery
The field has built consensus around AI's potential in early drug discovery: hypothesis generation, target prioritization, and screening have all benefited from machine learning. But a hard gap remains between computational prediction and clinical translation. Ochre Bio and others are arguing that the bottleneck is not algorithmic sophistication—it is the quality and provenance of the biological data itself.
Many public biological datasets are fundamentally correlational. An AI model trained on correlation can identify statistical associations humans miss, but it cannot answer the causal questions drug discovery demands: if a gene is modified, what follows? If a protein is inhibited, does disease regress? Animal models, the historical gold standard for these questions, often fail to predict human outcomes. Liver fibrosis is a clear example: mouse models do not reliably predict fibrosis regression in humans.
In April 2025, the FDA announced a plan to reduce and potentially replace some animal testing requirements with new methodologies, including AI-based computational models, cell lines, and organoids. That regulatory shift creates both pressure and opportunity: the field will need human-derived data to replace preclinical animal studies, but generating that data reproducibly at scale remains hard.
The Ochre-Lexogen collaboration suggests a structural shift in competitive advantage. If algorithms become more accessible (and they are), proprietary biological data and the ability to generate it reproducibly may become the defensible moat. Companies and partnerships that tightly integrate experimental biology, RNA sequencing, quality control, and computational analysis will have leverage. Treating those steps as separate silos will not.
Verify the causal signal in your training data
If your drug discovery workflow relies on public datasets or simplified cellular models, ask whether your training data can answer causal questions or only correlational ones. If you are using animal models for fibrosis, toxicity, or any complex tissue-level process where human translation has historically failed, map a path toward primary human tissue experiments with paired transcriptomics.
The regulatory environment is moving in this direction. Being early on the human-data path will matter when animal model waivers become standard rather than exceptional.