TxGemma: Efficient and Agentic LLMs for Therapeutics

📝 Paper Summary

LLMs for Chemistry & Biology Agentic AI for Scientific Discovery

TxGemma is a suite of efficient generalist LLMs and agentic systems fine-tuned on diverse therapeutic data to unify property prediction, reasoning, and external tool usage for drug development.

Core Problem

Therapeutic development relies on fragmented, costly experimental procedures or specialized narrow models, while existing generalist LLMs lack the domain-specific precision and up-to-date knowledge required for drug discovery.

Why it matters:

High attrition rates and costs in drug development require efficient prioritization of candidates early in the pipeline
Current tools are bifurcated: specialized models (accurate but narrow black boxes) vs. general LLMs (conversational but hallucinate on chemical properties)
Scientists need systems that can not only predict properties but also explain mechanistic reasoning and orchestrate complex multi-step workflows (e.g., retrieving data, transforming structures)

Concrete Example: When asked to predict if a specific molecule crosses the blood-brain barrier, a standard LLM might refuse or hallucinate based on general text. TxGemma-Chat correctly predicts 'crosses the BBB' and provides mechanistic reasoning based on lipophilicity and molecular weight derived directly from the SMILES structure.

Key Novelty

TxGemma & Agentic-Tx

Fine-tunes Gemma-2 (2B, 9B, 27B) on a massive collection of 66 therapeutic tasks (TDC) using instruction tuning to create robust property predictors (TxGemma-Predict)
Combines therapeutic data with general instruction data to create conversational models (TxGemma-Chat) that can reason about molecular structures
Wraps these models in an agentic system (Agentic-Tx) using the ReAct framework, allowing it to autonomously use tools (toxicity predictors, PubMed search, gene databases) to solve complex multi-step problems

Evaluation Highlights

TxGemma-27B-Predict outperforms or matches the state-of-the-art generalist model (Tx-LLM) on 64 out of 66 therapeutic tasks
Agentic-Tx (Gemini 2.5-Pro) achieves 84.5% on ChemBench-Mini, outperforming o3-mini (high) by 2.4% and GPT-4o by 12.5%
Agentic-Tx achieves 20.1% on Humanity's Last Exam (Chemistry & Biology), a 52.3% relative improvement over the previous best model, o3-mini (high)

Breakthrough Assessment

9/10

Significant leap in domain-specific agents. Achieves SOTA on very hard benchmarks (Humanity's Last Exam) and unifies high-performance property prediction with conversational reasoning in an open-weights model suite.

⚙️ Technical Details

Problem Definition

Setting: Multi-task therapeutic instruction tuning and agentic reasoning

Inputs: Natural language instructions combined with biochemical entities (SMILES, amino acid sequences, nucleotide sequences, disease names)

Outputs: Predicted properties (classification/regression), generated molecules (SMILES), or reasoned explanations

Pipeline Flow

User Query → Agentic-Tx (Router/Controller)
Decision: Answer directly OR Invoke Tool
Tool Execution (TxGemma-Predict, PubMed, etc.) → Observation
Reasoning Loop (ReAct) → Final Answer

System Modules

Agentic-Tx Controller

Central agent utilizing ReAct to plan, reason, and call tools

Model or implementation: Gemini 2.5 (and variants Gemini 2.0/1.5)

TxGemma-Predict

Specialized property prediction (e.g., toxicity, binding affinity)

Model or implementation: Gemma-2 (2B, 9B, 27B) fine-tuned on TDC

TxGemma-Chat

Conversational reasoning and explanation of molecular properties

Model or implementation: Gemma-2 (9B, 27B) fine-tuned on mixed TDC + General Chat data

Novel Architectural Elements

Hierarchical integration where specialized fine-tuned LLMs (TxGemma-Predict) serve as distinct tools for a generalist agent (Agentic-Tx)
Unified instruction-tuning format treating 66 diverse therapeutic tasks (small molecules, proteins, nucleic acids) as a single text-to-text problem

Modeling

Base Model: Gemma-2 (2B, 9B, 27B parameters)

Training Method: Full fine-tuning (SFT)

Objective Functions:

Purpose: Standard language modeling loss.

Formally: Next-token prediction cross-entropy loss.

Training Data:

7,080,338 training examples from TDC
Conversational models used a mixture of 30% therapeutic data and 70% general instruction-tuning data

Key Hyperparameters:

epochs: 12
total_tokens: 67 billion
hardware: 256 TPUv4 chips
+ 2 more
replication: 8-way data replication
sharding: 4-way sequence sharding, 4-way model sharding

Compute: Median inference time for Agentic-Tx tools is 0.55 seconds (fastest 0.15s, slowest 28.2s)

Comparison to Prior Work

vs. Tx-LLM: TxGemma includes conversational/reasoning capabilities (Chat variants) and is released as open weights (vs. PaLM-2 black box)
vs. MolE/LlaSMol: TxGemma covers a broader range of modalities (proteins, nucleic acids, cell lines) beyond just small molecules
vs. o1/o3-mini: Agentic-Tx integrates domain-specific tools (TxGemma-Predict) to ground reasoning, reducing hallucination in specialized tasks

Limitations

TxGemma-Chat models show reduced performance on pure regression tasks compared to Predict models (10-11% drop)
Conversational fine-tuning slightly degrades general knowledge performance (MMLU) compared to base Gemma-2
Agentic-Tx relies on closed-source Gemini models for the controller, limiting full open reproduction of the agentic system
Hallucination of reasoning is still possible if prompted incorrectly (though reduced via Chain of Thought)

Reproducibility

TxGemma models trained on commercially licensed data are released as open models (HuggingFace). Code URL is implied via Google DeepMind/Research context but not explicitly linked in text (standard for DeepMind papers to release on HF). Training data is derived from public Therapeutics Data Commons (TDC). Agentic-Tx relies on Gemini 2.5 (closed source API).

📊 Experiments & Results

Evaluation Setup

Comprehensive benchmarking on Therapeutic Data Commons (TDC) and reasoning benchmarks

Benchmarks:

Therapeutic Data Commons (TDC) (66 tasks (classification, regression, generation) across diverse entities)
ChemBench (Chemistry reasoning (Mini and Preference subsets))
Humanity's Last Exam (HLE) (Hardest-available multi-disciplinary reasoning (Chemistry & Biology subset))
GPQA Diamond (Graduate-level scientific QA (Chemistry subset))

Metrics:

AUROC
Spearman Correlation
MAE (Mean Absolute Error)
Accuracy
Statistical methodology: Wilcoxon signed-rank test for aggregate model comparison; bootstrapping (1000 samples) for confidence intervals.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TxGemma-27B-Predict demonstrates broad superiority over previous generalist models on the TDC benchmark.
TDC (66 tasks aggregated)	Win Count vs Tx-LLM M	0	45	+45
TxGemma competes favorably with specialized models designed for specific tasks (e.g., MolE for small molecules).
TDC (Small Molecule Tasks)	Win/Tie Count	0	15	+15
Caco2 Wang (Pharmacokinetics)	MAE	0.329	0.401	+0.072
HIA Hou (Pharmacokinetics)	AUROC	0.984	0.988	+0.004
Agentic-Tx achieves state-of-the-art performance on difficult reasoning benchmarks, significantly outperforming top proprietary models.
Humanity's Last Exam (Chem & Bio)	Accuracy	13.2	20.1	+6.9
ChemBench (Preference)	Accuracy	82.5	84.5	+2.0
GPQA Diamond (Chemistry)	Accuracy	62.0	81.7	+19.7

Main Takeaways

TxGemma-Predict effectively unifies 66 therapeutic tasks, outperforming previous generalists (Tx-LLM) and matching specialists, proving the viability of a single model for diverse biological entities.
Agentic-Tx demonstrates that wrapping LLMs with domain-specific tools (TxGemma-Predict) and search capabilities significantly boosts performance on hard reasoning tasks (HLE, GPQA) compared to raw LLMs (even o1/o3-mini).
There is a trade-off between conversational ability and regression precision: TxGemma-Chat loses ~10% performance on property prediction vs. TxGemma-Predict but gains the ability to explain reasoning.
Data efficiency: Fine-tuning TxGemma on downstream tasks (e.g., Clinical Trial Adverse Events) requires less data than base models to reach comparable performance.

📚 Prerequisite Knowledge

Prerequisites

Machine Learning in Drug Discovery (molecular representations)
Large Language Model Fine-tuning (Instruction Tuning)
Agentic Frameworks (ReAct)

Key Terms

SMILES: Simplified Molecular Input Line Entry System—a text notation for representing chemical structures

TDC: Therapeutics Data Commons—a large benchmark suite of datasets for drug discovery tasks

ReAct: Reason+Act—a paradigm where agents generate reasoning traces and task-specific actions (tool calls) in an interleaved manner

IC50: Half maximal inhibitory concentration—a measure of the potency of a substance in inhibiting a specific biological or biochemical function

AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for classification problems at various threshold settings

MMLU: Massive Multitask Language Understanding—a benchmark designed to measure knowledge acquired during pretraining

MAE: Mean Absolute Error—a measure of errors between paired observations expressing the same phenomenon

Fingerprints: Bit-vector representations of molecular structure (e.g., Morgan fingerprints) used for similarity search

Instruction Tuning: Fine-tuning a pre-trained language model on a collection of formatted datasets (instruction, input, output) to improve its ability to follow tasks

Agentic-Tx: The proposed system that uses an LLM as a controller to orchestrate tools (like property predictors and web search) to solve complex therapeutic problems