Evaluation Setup
Zero-shot or Few-shot prompting on domain-specific datasets in Finance, Law, and STEM
Benchmarks:
- CAIL2018 (Legal judgment prediction (Chinese))
- FinNA (Financial news analysis (Chinese))
- MATH (Mathematics problems (English))
- GaoKao (College entrance exam questions (Chinese))
- MMLU (Multi-task language understanding (English))
Metrics:
- Accuracy (Exact Match or equivalent)
- F1 score
- Rouge-L
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Legal (CAIL2018) |
Accuracy/Score |
34.00 |
49.30 |
+15.30
|
| Legal (CAIL2018) |
Accuracy/Score |
30.60 |
38.10 |
+7.50
|
| Consistent improvements observed across different model scales on the Legal dataset. |
| Legal |
Accuracy |
45.00 |
50.10 |
+5.10
|
| Legal |
Accuracy |
56.10 |
61.30 |
+5.20
|
Main Takeaways
- Re-TASK consistently outperforms standard CoT and other prompting baselines across diverse domains (Law, Finance, STEM).
- The framework effectively scales, providing performance benefits to both smaller (8B) and larger (110B) models.
- Improvements are particularly notable in domain-specific tasks where specialized knowledge and specific procedural skills are required, validating the hypothesis that CoT fails due to capability gaps.