Generative Representational Instruction Tuning

📝 Paper Summary

Modularized RAG pipeline Representation Learning Instruction Tuning

GRIT unifies generative and embedding capabilities into a single LLM by distinguishing tasks via instructions, achieving state-of-the-art performance on both without losing efficiency.

Core Problem

Current language models excel at either generation or embedding but not both; using generative models for embeddings yields poor performance, while embedding models lack generative capabilities.

Why it matters:

RAG pipelines currently require separate models for retrieval and generation, doubling memory overhead and complicating infrastructure
Using generative model hidden states for embeddings without specific tuning leads to poor retrieval performance
Separate endpoints for generation and embedding increase load balancing and storage complexity for API providers

Concrete Example: In a standard RAG setup, a user query must be processed by an embedding model to find context, then both query and context are processed by a separate generative model. This requires loading two large models and prevents caching computations between the retrieval and generation steps.

Key Novelty

Generative Representational Instruction Tuning (GRIT)

Trains a single LLM on both generative tasks (predict next token) and embedding tasks (contrastive loss on hidden states) simultaneously using distinguishing instructions
Allows the same model weights to act as a dense retriever (via embedding instructions) and a generator (via generative instructions), enabling caching of internal states between steps

Architecture

The unified training format for GRIT, showing how different instructions trigger different processing modes (Representation vs. Generation) within the same batch.

Evaluation Highlights

GritLM 7B sets a new state-of-the-art on the Massive Text Embedding Benchmark (MTEB) among open models (score 66.8), outperforming larger models like Llama 2 70B used for embeddings
Outperforms Llama 2 70B on generative tasks by >20% while matching embedding-only baselines
Speeds up RAG inference by >60% for long documents by caching shared computations between the retrieval and generation phases

Breakthrough Assessment

9/10

Successfully unifies two distinct paradigms (generation and embedding) into one model with SOTA results on both. significantly simplifies RAG architecture and improves efficiency.

⚙️ Technical Details

Problem Definition

Setting: Multi-task learning where a single model parameters $\theta$ must solve both generative tasks (causal language modeling) and embedding tasks (semantic vector representation)

Inputs: Natural language text sequence $x$ with an instruction prefix

Outputs: Either a generated text sequence (for generative tasks) or a vector representation $v \in \mathbb{R}^d$ (for embedding tasks)

Pipeline Flow

Input Processing (Instruction Formatting)
Unified Transformer Backbone
Task-Specific Head (Pooling for Embedding / LM Head for Generation)

System Modules

Instruction Formatter

Applies specific format tags to distinguish tasks

Model or implementation: Rule-based

Transformer Backbone

Processes tokens to produce hidden states

Model or implementation: Mistral 7B or Mixtral 8x7B

Embedding Head (Task Execution)

Aggregates hidden states into a single vector representation

Model or implementation: Mean Pooling

Generation Head (Task Execution)

Predicts next token probabilities

Model or implementation: Linear Layer (LM Head)

Novel Architectural Elements

Unified architecture capable of switching between Causal Attention (for generation) and Bidirectional Attention (for embedding) based on task instructions within the same model weights

Modeling

Base Model: Mistral 7B and Mixtral 8x7B

Training Method: Multi-objective fine-tuning

Objective Functions:

Purpose: Optimize embedding quality by pulling relevant pairs close and pushing negatives apart.

Formally: Contrastive loss with in-batch negatives: $L_{Rep} = -\frac{1}{M} \sum_{i=1}^M \log \frac{e^{\sigma(f(q_i), f(d_i)) / \tau}}{\sum_{j \in B} e^{\sigma(f(q_i), f(d_j)) / \tau}}$
Purpose: Optimize text generation capability.

Formally: Causal Language Modeling loss: $L_{Gen} = -\frac{1}{N} \sum_{i=1}^N \log P(x_i | x_{<i})$
Purpose: Joint optimization.

Formally: $L = \lambda_{Rep} L_{Rep} + \lambda_{Gen} L_{Gen}$

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (7B or 8x7B)

Training Data:

Embedding data: E5 dataset + S2ORC (scientific papers)
Generative data: Tülu 2 dataset (filtered)
Batch size: 2048 (embedding) / 256 (generative) for 7B model

Key Hyperparameters:

embedding_batch_size: 2048 (7B), 256 (8x7B)
generative_batch_size: 256
training_steps: 1253
+ 3 more
precision: BF16 (Mixed Precision)
pooling: Mean pooling of final hidden states (weighted)
max_seq_len_embedding: 512 (eval), up to 2048 (train)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. SGPT: GRIT retains generative capabilities while SGPT typically loses them or requires adapters
vs. E5: GRIT is a unified model for both tasks; E5 is embedding-only
vs. Llama 2 (base): GRIT adds effective embedding capabilities lacking in base Llama 2
+ 1 more
vs. RetroMAE [not cited in paper]: RetroMAE uses auto-encoding for retrieval pre-training; GRIT uses instruction tuning on a generative base

Limitations

Higher computational cost during fine-tuning compared to single-objective training due to dual forward/backward passes
Embedding dimensionality (4096) is 4x larger than standard baselines (e.g., BGE Large's 1024), increasing storage costs
Performance on very long contexts (>512 tokens) for embeddings is not thoroughly benchmarked despite model capability

Reproducibility

Code: https://github.com/ContextualAI/gritlm

Code, models, and data scripts are publicly available at https://github.com/ContextualAI/gritlm. The paper details hyperparameters, loss configurations, and data sources (E5, Tülu 2) sufficient for replication.

📊 Experiments & Results

Evaluation Setup

Evaluation on massive embedding benchmark and standard generative tasks

Benchmarks:

MTEB (Embedding (Classification, Clustering, Retrieval, Reranking, etc.))
MMLU (Generative (Knowledge/Reasoning))
AlpacaEval (Generative (Instruction Following))
HumanEvalSynthesize (Generative (Code Generation))

Metrics:

MTEB Average Score
MMLU Accuracy
AlpacaEval Win Rate
GSM8k Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GritLM achieves state-of-the-art embedding performance among open models while maintaining strong generative capabilities.
MTEB	Average Score	64.2	66.8	+2.6
MTEB	Average Score	35.6	66.8	+31.2
MMLU	Accuracy (5-shot)	60.9	60.4	-0.5
GSM8k	Accuracy (8-shot)	51.1	57.5	+6.4
Average (Generative Tasks)	Average Score	44.6	50.1	+5.5
Ablations show that unified training matches separate training performance and bidirectional attention is crucial for embeddings.
MTEB	Average Score	66.5	66.8	+0.3
MTEB	Average Score	63.7	65.6	+1.9

Experiment Figures

Scatter plot comparing Open Generative Average vs. MTEB Average for various models (GritLM, Llama, Mistral, BGE, etc.).

Conceptual diagram of GRIT unifying Generative Instruction Tuning and Representational Instruction Tuning.

Main Takeaways

Unified models (GRIT) match the performance of specialized embedding-only and generative-only models, effectively enabling a 'free lunch' in terms of capabilities.
Bidirectional attention is critical for high-quality embeddings from decoder-only LLMs, outperforming purely causal attention.
RAG inference latency can be reduced by >60% for long documents because the unified model allows caching the encoded context (avoiding re-computation by a separate generator).
Generative performance of a base model is a better predictor of final finetuned embedding quality than its initial zero-shot embedding performance.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanisms)
Contrastive Learning (InfoNCE loss)
Instruction Tuning
Retrieval-Augmented Generation (RAG)

Key Terms

GRIT: Generative Representational Instruction Tuning—a method to train LLMs for both text generation and embedding capabilities simultaneously

MTEB: Massive Text Embedding Benchmark—a comprehensive suite of datasets for evaluating text embedding models

in-batch negatives: A contrastive learning technique where other samples in the same training batch serve as negative examples for a given query-document pair

bidirectional attention: Attention mechanism where tokens can attend to both past and future tokens (unlike causal attention which only looks back)

causal attention: Attention mechanism where tokens can only attend to previous tokens, standard in generative models like GPT

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

BF16: Bfloat16—a floating-point format that preserves the dynamic range of 32-bit floats but with lower precision, used to speed up training

Bi-Encoder: An architecture where query and document are encoded separately into vectors, allowing fast retrieval via dot product

Cross-Encoder: An architecture where query and document are processed together by the model to output a relevance score, more accurate but computationally expensive

KTO: Kahneman-Tversky Optimization—an alignment tuning method for language models based on human utility functions