| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance drop on fresh data: Models perform significantly worse on the new CausalProbe-2024 benchmark compared to older benchmarks (COPA, e-CARE), supporting the hypothesis that high performance on old tasks is due to memorization. | ||||
| CausalProbe-2024 Hard | Accuracy | 99.0 | 70.0 | -29.0 |
| CausalProbe-2024 Hard | Accuracy | 85.0 | 50.0 | -35.0 |
| Min-K% Prob analysis confirms that CausalProbe-2024 is 'fresher' (less likely to be in training data) compared to older benchmarks. | ||||
| CausalProbe-2024 vs Older | Min-K% Prob | High | Low | Negative |