Pharma pools proprietary data to train AI models without sharing secrets

Federated learning moves from proof-of-concept to industry adoption

Eli Lilly launched TuneLab in 2025, a platform offering free access to the company's proprietary AI and machine learning models trained on what Lilly calls "over a billion dollars in data" accumulated across decades of internal datasets. Partners who use the models are expected to contribute their own datasets to improve them. As of the Bio-IT World Conference 2026 keynote in Boston, the platform had attracted more than 75 partners operating across three continents and dozens of countries (company-reported).

In a separate announcement at the conference, Eli Lilly and Collaborative Drug Discovery (CDD) said they would integrate TuneLab into CDD Vault's core and AI modules, expanding the federated network's reach into CDD's customer base.

The technical foundation underpinning these efforts is federated learning, a method that trains AI models on data stored in each partner's own secure environment, ensuring proprietary data never leaves the organization. José-Tomás Prieto, PhD, director of AI programs at Apheris (which provides federated learning infrastructure), emphasized that implementation at scale requires "engineering rigor, data preparation without centralization, and enterprise-level deployment strategies"—not plug-and-play tooling.

Public data is biased; private data is proprietary. Federated learning splits the difference

AlphaFold and its successor OpenFold achieved landmark improvements in protein structure prediction but relied on public datasets. This created a structural problem: both models predict targets closer to their training data accurately, but predictive accuracy drops sharply for targets distant from the training set. Public datasets also skew toward well-characterized proteins, missing the diversity and higher quality of internal pharma datasets.

"You cannot model your way out of a data problem, and you can't buy this data either," Prieto said. Federated learning addresses this by allowing companies to pool data without centralizing it, training shared models on a much broader and higher-quality signal while keeping IP intact.

The other critical element is fine-tuning. Arman Zaribafiyan, PhD, head of strategic alliances at SandboxAQ, noted that foundation models achieve strong benchmark results but "fail to generalize to real drug discovery use cases." Fine-tuning models on project-specific data dramatically improves predictive accuracy. This means the payoff from federated platforms depends less on raw model size and more on how well partners can customize models to their own targets and workflows.

Build your federated learning capability now, not after your first failed deployment

Implementing federated learning at scale requires upfront investment in data harmonization and security review. Each partner company brings its own data standards, firewall rules, and compute constraints. Prieto noted that "data preparation without centralization is a new paradigm"—each organization must harmonize its internal data to match training setups used by other nodes in the network, adding complexity that IT and data governance teams rarely have experience with.

For biotech and pharma decision-makers, the strategic decision is not whether to adopt federated learning but which network to join and how early. TuneLab is currently focused on small molecules and antibody development, with additional models planned. Apheris supports the AI Structural Biology Network, which brings together several top-20 biopharma companies. Both require commitments to share data and contribute to model improvement.

The gap between benchmark performance and real-world drug discovery performance means selecting the right federated platform is now a competitive advantage. Companies that delay entry will train on data that other network members have already contributed, reducing the strategic value of joining.

Pharma pools proprietary data to train AI models without sharing secrets

Our Take

Why it matters

Do this week

Federated learning moves from proof-of-concept to industry adoption

Public data is biased; private data is proprietary. Federated learning splits the difference

Build your federated learning capability now, not after your first failed deployment

One daily brief. Every story gets a hype verdict.

Related stories

The 30-Day AI-Native Challenge: a free/freemium roadmap to real AI skills

Your AI compliance gap is wider than your governance framework

Compliance teams ditch spreadsheets for unified EDD software