← Back to Paper List

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiao-wen Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhen Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, et al.
Shanghai AI Laboratory, SenseTime Group, The Chinese University of Hong Kong, Fudan University
arXiv.org (2024)
Pretraining RL Memory Benchmark

📝 Paper Summary

Large Language Model Pre-training Long-context modeling RLHF Alignment
InternLM2 is an open-source LLM series (1.8B-20B) achieving strong performance and 200k context windows through a multi-stage training pipeline involving long-context pre-training and a novel Conditional Online RLHF strategy.
Core Problem
Replicating the capabilities of proprietary models like GPT-4 in open-source remains difficult due to challenges in data processing, scaling context length efficiently, and aligning models with conflicting human preferences.
Why it matters:
  • Many open-source models struggle with long-context tasks critical for RAG and agents
  • Existing technical reports often omit crucial details on pre-training data preparation and long-context extension strategies
  • Standard RLHF often suffers from reward hacking and difficulty reconciling diverse human preference distributions
Concrete Example: In standard RLHF, a model might learn to hack the reward by generating safe but non-helpful responses, or fail to balance conflicting preferences (e.g., creativity vs. precision). InternLM2 addresses this via Conditional Online RLHF to manage these conflicts.
Key Novelty
Conditional Online RLHF (COOL RLHF) and 200k Context Extension
  • Introduces COOL RLHF (Conditional Online RLHF), using a conditional reward model to reconcile conflicting preferences and multi-round PPO to mitigate reward hacking
  • Implements a progressive long-context training strategy: starting with 4k context, transitioning to high-quality 32k data, and using positional encoding extrapolation to reach 200k context inference capability
  • Develops InternEvo, a training framework optimizing 4D parallelism (data, tensor, sequence, pipeline) to achieve high Model FLOPs Utilization at scale
Evaluation Highlights
  • Nearly perfect performance identifying 'needles' in the 200k 'Needle-in-a-Haystack' test
  • InternEvo framework achieves 88% Model FLOPs Utilization (MFU) when training 7B models with 256k sequence length, compared to ~65% for DeepSpeed-Ulysses
  • InternLM2 outperforms predecessors and comparable open-source models (LLaMA, Qwen, Mistral) across comprehensive evaluations on 30 benchmarks
Breakthrough Assessment
8/10
Strong contribution to open-source LLMs by providing a full report on the 200k context extension and a novel RLHF approach (COOL RLHF), supported by solid infrastructure optimization results.
×