Evaluation Setup
Long-form generation on open-domain and information-seeking prompts, evaluated against Wikipedia knowledge
Benchmarks:
- LongFact-Concepts (Long-form factuality generation (Concepts subset))
- FactScore-Bio (Biography generation)
Metrics:
- Factual Precision (percentage of supported atomic statements)
- Factual F1 (harmonic mean of precision and recall@K)
- Response Length (number of atomic statements)
- GPT-4 Win-rate (helpfulness assessment)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results on LongFact-Concepts showing FactAlign improves both factuality (F1) and helpfulness compared to the base model and standard alignment baselines. |
| LongFact-Concepts |
Factual F1 |
36.2 |
41.1 |
+4.9
|
| LongFact-Concepts |
Factual F1 |
39.5 |
41.1 |
+1.6
|
| LongFact-Concepts |
Factual Precision |
69.1 |
73.2 |
+4.1
|
| FactScore-Bio |
Factual Precision |
74.7 |
83.5 |
+8.8
|
| FactScore-Bio |
Number of Facts (Recall proxy) |
50.1 |
53.2 |
+3.1
|
| Ablation studies demonstrating the specific contribution of the sentence-level fKTO loss. |
| LongFact-Concepts |
Factual F1 |
39.6 |
41.1 |
+1.5
|
Main Takeaways
- FactAlign improves factual F1 by encouraging the model to generate more correct facts rather than just shortening responses to minimize errors (a common failure mode of precision-only optimization).
- Fine-grained (sentence-level) alignment via fKTO provides superior signals compared to coarse (response-level) alignment, allowing the model to distinguish factual from non-factual parts of a single response.
- The method maintains or improves general helpfulness (GPT-4 win rate) while improving factuality, mitigating the 'alignment tax' often observed where factual models become terse or unhelpful.
- Iterative training is effective: the model improves by training on its own high-quality generations filtered by the factuality evaluator.