← Back to Paper List

DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

Yutong Yan, Raphael Tang, Zhenyu Gao, Wenxi Jiang, Yao Lu
The Chinese University of Hong Kong, CUHK Business School, University College London, Centre for Artificial Intelligence
arXiv (2026)
Pretraining Factuality Benchmark

πŸ“ Paper Summary

Financial Large Language Models Temporal Generalization Backtesting Validity
DatedGPT is a series of language models trained on annually partitioned data to ensure predictions rely only on historically available information, preventing lookahead bias in financial backtesting.
Core Problem
Standard LLMs trained on internet-scale data have 'lookahead bias'β€”they effectively know the future outcomes of historical events (e.g., the 2008 crash), invalidating their use in financial backtesting and temporal forecasting.
Why it matters:
  • Financial backtesting requires strictly uncontaminated predictions; models exposed to future data yield falsely optimistic performance
  • Evaluating reasoning vs. memorization is impossible if the model has already 'seen' the event outcome during pretraining
  • Current LLMs lack explicit temporal cutoffs in their pretraining data, making them unsuitable for simulating historical decision-making
Concrete Example: If a model predicts the S&P 500 crash of September 29, 2008, after Congress rejected a bailout, it may be recalling the historical fact from its 2024 training data rather than analyzing the news sentiment available at that moment.
Key Novelty
Time-Aware Pretraining with Strict Annual Cutoffs
  • Training a separate model from scratch for each year (2013–2024), where the training data is strictly filtered to exclude any web pages crawled after that specific year
  • Curating instruction-tuning datasets that are also temporally filtered (using an LLM teacher to classify and remove time-sensitive future information) to maintain the cutoff integrity
Evaluation Highlights
  • Achieves an average score of up to 42.7 on general language understanding benchmarks (HellaSwag, PIQA, etc.), competitive with similarly sized models like TinyLlama-1.1B
  • Qualitative analysis demonstrates 'perplexity reversal'β€”the model's perplexity increases sharply on text from years after its cutoff, confirming it has not learned future information
  • Successfully filters time-sensitive instruction data (e.g., removing requests for TV scripts released after the cutoff) using Llama-3.3-70B-Instruct as a judge
Breakthrough Assessment
7/10
Addresses a critical methodological flaw in financial AI (lookahead bias) with a rigorous, brute-force engineering approach (12 separate models). While architecturally standard, the dataset curation contribution is significant for the finance domain.
×