Scaling AI-Driven Drug Nominations from 250 to 7,000 Compounds

Context

Early 2023 presented a defining challenge: Montai’s AI models could predict activity across millions of compounds, but manual library creation processes were bottlenecked at ~250 compounds per program. The central question wasn’t whether AI could generate predictions — it was whether we could build a scalable system that maintained scientific rigor while expanding the search space 20×.

Facts:

Baseline: 100’s of compounds, chosen manually for screening from within existing library
Stakes: Scale 10× to 100× per program to enabled by bioactivity ML models
Environment: Early-stage biotech, unproven concept
My role: First data science/product hire, architected pipeline

The challenge

How do you architect a multi-objective decision system, that provides an optimal starting point for drug discovery funnels, is understandable by all the key decision-makers at the organization?

This was a multi-faceted problem — ML scientists wanted maximum chemical diversity, medicinal chemists needed synthetic feasibility, and program leads to ensure they didn’t waste their team’s time on a batch of inactive compounds. Each stakeholder brought valid constraints — the pipeline needed to serve all groups while delivering high quality recommendations that would bolster confidence in the AI approach.

The solution

I owned:

End-to-end ‘nomination’ pipeline architecture: Our team constructed an orchestrated SQL pipeline that enabled fully traceable, human-interpretable
ML integration strategy (model outputs → analytics)
Data quality framework for predictions
Phased rollout strategy (MVP → scale → quality)

I influenced:

Model selection criteria (with Duminda Ranasinghe, Lead ML Scientist)
Nomination criteria per program (with Jake Ombach, Computational Biologist)
External data partnerships (XtalPi, vendor libraries)

Decision Frame

Problem statement:

Build a data pipeline that scales compound nominations 20× while maintaining or improving experimental hit rates, constrained by:

Unproven AI prediction quality
Limited ML engineering capacity (small team)
Need for rapid learning cycles (not perfect first try)

Options considered:

Option A: Conservative scale (500-1000 nominations)

Pros: Lower risk, manageable if quality issues emerge
Cons: May miss opportunities in vast chemical space
Risk: Underutilize AI capability, slow learning

Option B: Aggressive scale (10,000+ immediately)

Pros: Maximum coverage of chemical space
Cons: Drowning in low-quality candidates, scientist overload
Risk: Lost trust if hit rate crashes, wasted experimental capacity

Option C: Phased scaling with quality gates

Pros: Learn at each phase, adjust criteria, build confidence
Cons: More complex coordination, requires patience
Risk: Slower initial progress

Decision: Chose Option C because:

[FACTS from archaeology p.20-21:]

Phase 1 (2023): Baseline nomination (prove concept with known compounds)
Phase 2 (2024): Expand to 1000+ (maximize learning via generative + commercial)
Phase 3 (late 2024): Quality filters (diversity, confidence thresholds)

Sequencing logic: Prove → scale → refine (not perfect upfront)

Constraints:

Pipeline latency: Manual processes took 2-3 days → needed automation
Data volume: 10M compounds (2023) → 258M predictions (2024)
Team capacity: Solo data lead initially, growing to 5-6 by 2024

Outcome

Primary outcome:

Scaled from 250 → 6,500+ nominations per program (26× increase) while IMPROVING hit-to-lead rates:

TNFR1: 27% hit-to-lead (vs ~5% baseline)
Multiple programs advanced to lead optimization faster

The significance: This wasn’t just volume scaling — quality improved alongside throughput. By implementing phased quality gates and leveraging diverse data sources (generative models, commercial libraries, XtalPi partnerships), we validated that AI-driven discovery could outperform manual selection. This became Montai’s core operational advantage and a key narrative for fundraising.

Metrics:

Nomination throughput: 250 → 5,000-7,000 per program (Q4 2024)
Hit-to-lead conversion: 5% → 15-30% range
Pipeline latency: 2-3 days → same-day updates
Model predictions: 10M → 258M compounds scored

Guardrails maintained:

Nomination quality didn’t degrade with scale (improved filters offset volume)
Scientist time per compound review stayed manageable (self-service dashboards)
Data infrastructure handled 10× load increase without major incidents (until STAT6 - separate case)

Second-order effects:

Enabled XtalPi partnership analysis (build vs buy decision)
Created reusable pipeline for future programs (MARS, cACN v2/v3)
Demonstrated industrialized discovery to investors (fundraising narrative)

Limitations acknowledged:

Diminishing returns emerged at ~5K nominations (quality > quantity phase needed)
Generated Anthrologs initially unusable (separate failure/pivot story)
Pipeline automation never fully complete (some manual triggers remained)

Reflection

What I’d do differently:

Looking back at the archaeology of this work, three decisions stand out as suboptimal:

Start with tighter nomination criteria earlier (wasted effort on low-probability compounds in Phase 2)
Invest in monitoring infrastructure upfront (reactive vs proactive on data quality)
Engage chemists more in generative model development (synthetic feasibility blindspot)

The first two were classic “move fast and learn” tradeoffs that proved correct in hindsight — we needed the volume data to understand quality needs. The third was a genuine miss: treating synthetic feasibility as a post-generation filter rather than baking it into model training cost us months.

What this taught me about decision-making:

Three principles emerged that I’ve since applied consistently:

Phased rollouts with learning gates beat perfect upfront design — you can’t architect your way out of uncertainty, you build to learn
Volume metrics mislead without quality tracking — nomination count was a vanity metric until we paired it with hit-to-lead conversion
Stakeholder confidence requires visible iteration — scientists trusted the pipeline because they saw us adjust criteria based on their feedback, not because the first version was perfect

How this informs future decisions:

These lessons directly shaped my approach to subsequent projects:

Always define success criteria per phase before executing — the Learning Agenda framework codified this
Build quality frameworks alongside feature development, not after crises — the STAT6 incident reinforced this
Balance exploration (maximize learning) with exploitation (optimize known strategies) — this framing now guides my portfolio thinking

Factual Evidence Citations:

Drug Project Overviews.xlsx (nomination counts per program)
2024 Mid-Year Review (pipeline automation goals)
Quantitative Outcomes Inventory p.30 (metrics table)
Product Strategy case study p.20-21 (phasing logic)