← Back to Paper List

Modifying Large Language Model Post-Training for Diverse Creative Writing

John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yuqian Sun, Max Kreminski
New York University
arXiv.org (2025)
RL Benchmark

📝 Paper Summary

LLM Post-training Creative Writing Generation
Diversified DPO and ORPO incorporate a 'deviation' metric into the training objective to encourage language models to generate semantically and stylistically diverse creative writing without sacrificing quality.
Core Problem
Post-training methods like DPO and RLHF improve quality but often cause 'mode collapse,' reducing the diversity of outputs which is critical for creative tasks with multiple valid answers.
Why it matters:
  • Creative writing tasks (e.g., story generation) have no single correct answer and require divergent thinking
  • Current LLMs produce homogenous content, limiting their utility as creative assistants
  • Existing diversification methods like high-temperature sampling often degrade coherence and quality (quality-diversity trade-off)
Concrete Example: For the prompt 'write a story about a dog on the moon,' standard models might repeatedly generate similar stories about the dog's adventure. A diverse model should produce varied narratives, such as the dog's lonely life, a scientific report, or a fantasy encounter, while maintaining high writing quality.
Key Novelty
Deviation-Weighted Preference Optimization (DDPO / DORPO)
  • Calculates 'deviation' for each training sample: how much it differs (semantically or stylistically) from other valid responses to the same prompt
  • incorporates this deviation into the DPO/ORPO loss function as a weight
  • Forces the model to learn from rare, high-quality instances rather than converging on the 'average' winning response
Evaluation Highlights
  • Achieves semantic diversity on par with human-created 'Gold' datasets (r/WritingPrompts) while maintaining quality
  • Outperforms existing instruction-tuned models (GPT-4o, Claude-3.5-Sonnet, DeepSeek-R1) in output diversity
  • Maintains writing quality ('reddit-reward') comparable to the best instruction-tuned baselines like GPT-4o and DeepSeek-R1
Breakthrough Assessment
7/10
Proposes a simple but effective modification to standard post-training objectives (DPO/ORPO) that addresses a known limitation (diversity) in creative generation. Results show it breaks the usual quality-diversity trade-off.
×