DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

📝 Paper Summary

Financial Large Language Models Temporal Generalization Backtesting Validity

DatedGPT is a series of language models trained on annually partitioned data to ensure predictions rely only on historically available information, preventing lookahead bias in financial backtesting.

Core Problem

Standard LLMs trained on internet-scale data have 'lookahead bias'—they effectively know the future outcomes of historical events (e.g., the 2008 crash), invalidating their use in financial backtesting and temporal forecasting.

Why it matters:

Financial backtesting requires strictly uncontaminated predictions; models exposed to future data yield falsely optimistic performance
Evaluating reasoning vs. memorization is impossible if the model has already 'seen' the event outcome during pretraining
Current LLMs lack explicit temporal cutoffs in their pretraining data, making them unsuitable for simulating historical decision-making

Concrete Example: If a model predicts the S&P 500 crash of September 29, 2008, after Congress rejected a bailout, it may be recalling the historical fact from its 2024 training data rather than analyzing the news sentiment available at that moment.

Key Novelty

Time-Aware Pretraining with Strict Annual Cutoffs

Training a separate model from scratch for each year (2013–2024), where the training data is strictly filtered to exclude any web pages crawled after that specific year
Curating instruction-tuning datasets that are also temporally filtered (using an LLM teacher to classify and remove time-sensitive future information) to maintain the cutoff integrity

Evaluation Highlights

Achieves an average score of up to 42.7 on general language understanding benchmarks (HellaSwag, PIQA, etc.), competitive with similarly sized models like TinyLlama-1.1B
Qualitative analysis demonstrates 'perplexity reversal'—the model's perplexity increases sharply on text from years after its cutoff, confirming it has not learned future information
Successfully filters time-sensitive instruction data (e.g., removing requests for TV scripts released after the cutoff) using Llama-3.3-70B-Instruct as a judge

Breakthrough Assessment

7/10

Addresses a critical methodological flaw in financial AI (lookahead bias) with a rigorous, brute-force engineering approach (12 separate models). While architecturally standard, the dataset curation contribution is significant for the finance domain.

⚙️ Technical Details

Problem Definition

Setting: Temporally grounded prediction where the model M_t must be trained only on data set D_t = {d | timestamp(d) <= t}

Inputs: Historical context (e.g., news headlines, earnings transcripts) available at time t

Outputs: Prediction of future outcome (e.g., stock return direction) without access to information from time > t

Pipeline Flow

Time-Aware Data Filtering (Pretraining & Instruction Data)
Model Pretraining (Year-Specific Base Models)
Instruction Tuning (Year-Specific Chat Models)

System Modules

Data Curator (Input Processing)

Filters FineWeb-Edu based on crawl timestamps to create 12 distinct datasets (2013–2024)

Model or implementation: Script-based filtering

Instruction Filter (Input Processing)

Removes time-sensitive queries (e.g., TV show scripts) from general instruction datasets to prevent leakage

Model or implementation: Llama-3.3-70B-Instruct (as classifier)

DatedGPT-Base

Predicts next token based strictly on historical knowledge up to year T

Model or implementation: 1.3B parameter Transformer (Llama/GPT-2 architecture)

Novel Architectural Elements

Temporal Partitioning Architecture: A series of 12 distinct models rather than a single model, each creating a hard boundary for knowledge availability

Modeling

Base Model: Custom 1.3B parameter model (Llama/GPT-2 style architecture)

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Training Data:

Pretraining: ~100B tokens per model year (FineWeb-Edu filtered by crawl date)
Instruction: Mixed general-domain (filtered) and finance-specific (news/earnings) data (~6,000 finance examples per year)
12 separate datasets for 12 models (2013, 2014... 2024)

Key Hyperparameters:

parameter_count: 1.3B
pretraining_iterations: 25,000
pretraining_tokens: ~100 billion
+ 3 more
fine_tuning_epochs: 3
fine_tuning_warmup: 10% linear warmup
scheduler: Cosine learning rate

Compute: 2,000 GPU hours on NVIDIA A100 GPUs per pretraining round (approx 24,000 total for 12 models)

Comparison to Prior Work

vs. Standard LLMs: DatedGPT physically separates training runs by year to guarantee zero leakage of future events, whereas standard LLMs mix all data
vs. Retraining from scratch (General): Most retraining focuses on *updating* knowledge; DatedGPT focuses on *limiting* knowledge to specific historical snapshots
vs. BloombergGPT [not cited in paper]: BloombergGPT is a single large finance model; DatedGPT is a *series* of smaller models designed specifically for backtesting validity

Limitations

Relies on crawl timestamps which may not reflect true content creation date (e.g., a 1990s article crawled in 2015 is included in the 2015 model)
Smaller model scale (1.3B parameters) limits reasoning capabilities compared to larger frontier models
Computationally expensive to maintain as it requires training a distinct model from scratch for every new year

Reproducibility

Code: http://www.datedgpt.com

Code and model checkpoints are not yet released (promised upon paper acceptance). A web demo is available at www.datedgpt.com. The dataset relies on FineWeb-Edu (public) and Bloomberg news/Earnings transcripts (proprietary/commercial access usually required).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on general language benchmarks and qualitative analysis of perplexity on future news

Benchmarks:

HellaSwag (Commonsense reasoning)
MMLU (Multitask academic knowledge)
IFEval (Instruction following)
Bloomberg News Headlines (Memorization/Perplexity probing) [New]

Metrics:

Accuracy
Perplexity
Average Score
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Training loss curves for the pretraining stage

Main Takeaways

DatedGPT models achieve an average score of up to 42.7 across standard benchmarks, showing that strict temporal filtering does not destroy general language capabilities compared to similarly sized models.
Models demonstrate 'perplexity reversal': they show low perplexity (high familiarity) with data prior to their cutoff and high perplexity (surprise) for data after their cutoff, confirming the effectiveness of the lookahead bias prevention.
Performance is consistent across different cutoff years (40.1 to 42.7 average), suggesting that the volume of data per year (~100B tokens) is sufficient for stable model performance regardless of the specific year.
The instruction-tuning stage uses a teacher-student approach (Llama-3.3-70B teacher) to create safe, date-aware instruction sets, ensuring the model doesn't learn future knowledge during fine-tuning.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) pretraining
Financial backtesting principles (lookahead bias)
Instruction tuning

Key Terms

Lookahead Bias: The error of using information in a prediction model that would not have been known or available during the period being simulated

Perplexity: A measurement of how well a probability model predicts a sample; in this context, high perplexity on future text indicates the model hasn't 'seen' it yet

FineWeb-Edu: A high-quality dataset derived from Common Crawl web data, filtered for educational content, used here as the base pretraining corpus

Backtesting: The process of testing a predictive model on historical data to estimate its performance

Instruction Tuning: Fine-tuning a pretrained model on dataset of (instruction, output) pairs to improve its ability to follow user commands

Crawl Date: The timestamp indicating when a web page was archived by a crawler; used here as a proxy for the information's availability date