Evaluation Setup
Synthetic Linear Regression task. Model predicts weight vector w* given context (x,y) pairs.
Benchmarks:
- Synthetic Linear Regression (Parameter Recovery / In-Context Learning) [New]
Metrics:
- Evaluation Loss (MSE between predicted weight and ground truth w*)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Theoretical bounds establishing the separation between standard transformers and CoT transformers. |
| Synthetic Linear Regression |
MSE of w* |
Theta(d^2/n) |
Theta(d^2/n) |
0
|
| Synthetic Linear Regression |
MSE of w* |
Theta(d^2/n) |
O(1/poly(d)) |
significant decrease
|
Main Takeaways
- One-layer transformers without CoT are theoretically incapable of recovering the ground truth weight vector w* when n ≈ d; they are limited to a single Gradient Descent step.
- CoT prompting allows the same architecture to perform multi-step Gradient Descent, reducing error from Θ(d^2/n) to near zero (O(1/poly(d))).
- The training dynamics (Gradient Flow) naturally find a solution where the attention weights implement GD updates.
- Trained CoT transformers generalize to Out-Of-Distribution (OOD) covariance matrices, provided they are well-conditioned.