← Back to Paper List

TxGemma: Efficient and Agentic LLMs for Therapeutics

Eric Wang, Samuel Schmidgall, Paul F. Jaeger, Fan Zhang, Rory Pilgrim, Yossi Matias, Joelle K. Barral, David Fleet, Shekoofeh Azizi
Google DeepMind, Google Research
arXiv.org (2025)
Agent Reasoning Benchmark Pretraining

📝 Paper Summary

LLMs for Chemistry & Biology Agentic AI for Scientific Discovery
TxGemma is a suite of efficient generalist LLMs and agentic systems fine-tuned on diverse therapeutic data to unify property prediction, reasoning, and external tool usage for drug development.
Core Problem
Therapeutic development relies on fragmented, costly experimental procedures or specialized narrow models, while existing generalist LLMs lack the domain-specific precision and up-to-date knowledge required for drug discovery.
Why it matters:
  • High attrition rates and costs in drug development require efficient prioritization of candidates early in the pipeline
  • Current tools are bifurcated: specialized models (accurate but narrow black boxes) vs. general LLMs (conversational but hallucinate on chemical properties)
  • Scientists need systems that can not only predict properties but also explain mechanistic reasoning and orchestrate complex multi-step workflows (e.g., retrieving data, transforming structures)
Concrete Example: When asked to predict if a specific molecule crosses the blood-brain barrier, a standard LLM might refuse or hallucinate based on general text. TxGemma-Chat correctly predicts 'crosses the BBB' and provides mechanistic reasoning based on lipophilicity and molecular weight derived directly from the SMILES structure.
Key Novelty
TxGemma & Agentic-Tx
  • Fine-tunes Gemma-2 (2B, 9B, 27B) on a massive collection of 66 therapeutic tasks (TDC) using instruction tuning to create robust property predictors (TxGemma-Predict)
  • Combines therapeutic data with general instruction data to create conversational models (TxGemma-Chat) that can reason about molecular structures
  • Wraps these models in an agentic system (Agentic-Tx) using the ReAct framework, allowing it to autonomously use tools (toxicity predictors, PubMed search, gene databases) to solve complex multi-step problems
Evaluation Highlights
  • TxGemma-27B-Predict outperforms or matches the state-of-the-art generalist model (Tx-LLM) on 64 out of 66 therapeutic tasks
  • Agentic-Tx (Gemini 2.5-Pro) achieves 84.5% on ChemBench-Mini, outperforming o3-mini (high) by 2.4% and GPT-4o by 12.5%
  • Agentic-Tx achieves 20.1% on Humanity's Last Exam (Chemistry & Biology), a 52.3% relative improvement over the previous best model, o3-mini (high)
Breakthrough Assessment
9/10
Significant leap in domain-specific agents. Achieves SOTA on very hard benchmarks (Humanity's Last Exam) and unifies high-performance property prediction with conversational reasoning in an open-weights model suite.
×