Modifying Large Language Model Post-Training for Diverse Creative Writing

📝 Paper Summary

LLM Post-training Creative Writing Generation

Diversified DPO and ORPO incorporate a 'deviation' metric into the training objective to encourage language models to generate semantically and stylistically diverse creative writing without sacrificing quality.

Core Problem

Post-training methods like DPO and RLHF improve quality but often cause 'mode collapse,' reducing the diversity of outputs which is critical for creative tasks with multiple valid answers.

Why it matters:

Creative writing tasks (e.g., story generation) have no single correct answer and require divergent thinking
Current LLMs produce homogenous content, limiting their utility as creative assistants
Existing diversification methods like high-temperature sampling often degrade coherence and quality (quality-diversity trade-off)

Concrete Example: For the prompt 'write a story about a dog on the moon,' standard models might repeatedly generate similar stories about the dog's adventure. A diverse model should produce varied narratives, such as the dog's lonely life, a scientific report, or a fantasy encounter, while maintaining high writing quality.

Key Novelty

Deviation-Weighted Preference Optimization (DDPO / DORPO)

Calculates 'deviation' for each training sample: how much it differs (semantically or stylistically) from other valid responses to the same prompt
incorporates this deviation into the DPO/ORPO loss function as a weight
Forces the model to learn from rare, high-quality instances rather than converging on the 'average' winning response

Evaluation Highlights

Achieves semantic diversity on par with human-created 'Gold' datasets (r/WritingPrompts) while maintaining quality
Outperforms existing instruction-tuned models (GPT-4o, Claude-3.5-Sonnet, DeepSeek-R1) in output diversity
Maintains writing quality ('reddit-reward') comparable to the best instruction-tuned baselines like GPT-4o and DeepSeek-R1

Breakthrough Assessment

7/10

Proposes a simple but effective modification to standard post-training objectives (DPO/ORPO) that addresses a known limitation (diversity) in creative generation. Results show it breaks the usual quality-diversity trade-off.

⚙️ Technical Details

Problem Definition

Setting: Creative writing generation where multiple valid outputs y exist for a prompt x

Inputs: Writing prompt x

Outputs: Diverse set of creative text responses {y_i}

Pipeline Flow

Input Prompt -> Diversity-Tuned LLM -> Generated Text

System Modules

Diversity-Tuned LLM

Generate creative text based on input prompt

Model or implementation: Llama-3.1-8B or Mistral-7B-v0.3 (fine-tuned with DDPO/DORPO)

Novel Architectural Elements

Training Objective Modification: The loss function for DPO/ORPO is weighted by the 'deviation' scalar of the winning response, derived from embedding distances (semantic or style) relative to other candidate responses.

Modeling

Base Model: Llama-3.1-8B and Mistral-7B-v0.3

Training Method: Diversified Direct Preference Optimization (DDPO) and Diversified Odds Ratio Preference Optimization (DORPO)

Objective Functions:

Purpose: Weigh the standard DPO loss by the uniqueness of the winning response.

Formally: L_DDPO = -δ_yw * log σ( β * log(π(yw)/ref(yw)) - β * log(π(yl)/ref(yl)) )
Purpose: Weigh the ORPO loss (Log Likelihood + Odds Ratio) by the uniqueness of the winning response.

Formally: L_DORPO = δ_yw * (L_SFT + λ * L_OR)

Adaptation: LoRA (rank=128, alpha=256)

Training Data:

r/WritingPrompts dataset
421,330 train prompt-response pairs
45,868 test prompt-response pairs

Key Hyperparameters:

learning_rate: 5e-6 (linear schedule)
batch_size: 2 (mostly)
beta: 0.1 (for DPO/DDPO)
+ 4 more
lambda: 0.25 (for ORPO/DORPO)
epochs: 3 (DPO), 4 (ORPO)
lora_rank: 128
lora_alpha: 256

Compute: Six NVIDIA H100 SXM GPUs, bfloat16 precision

Comparison to Prior Work

vs. DivPO: Integrates diversity directly into the loss function via deviation weighting rather than just filtering the dataset
vs. Temperature Sampling: Optimizes the model weights for diversity rather than relying on stochastic decoding which often degrades quality
vs. Prompting (GPT-4o-iter): Modifies the model's internal alignment rather than relying on iterative context stuffing [not cited in paper]

Limitations

Requires datasets with multiple responses per prompt (>3) to calculate deviation meaningfully
Quality may drop if there are too few instances per prompt during training
Relies on embedding models (Jina, Style-Embedding) which may not perfectly capture all nuances of creative diversity
Reward signal (Reddit upvotes) is noisy and subjective

Reproducibility

Code: https://github.com/mj-storytelling/DiversityTuning

Code is publicly available at https://github.com/mj-storytelling/DiversityTuning. Dataset is available on HuggingFace (euclaise/WritingPrompts_preferences). Evaluation uses Jina embeddings (v3) and Style-Embedding.

📊 Experiments & Results

Evaluation Setup

Generate 4 responses per prompt for 1000 test prompts; measure pairwise distance and predicted quality

Benchmarks:

r/WritingPrompts (Open-ended creative story generation)

Metrics:

reddit-reward (Predicted Upvote Score)
Semantic Diversity (Mean Pairwise Cosine Distance of Jina Embeddings)
Style Diversity (Mean Pairwise Cosine Distance of Style Embeddings)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Scatter plot of Writing Quality (y-axis) vs. Diversity (x-axis) for various models

Main Takeaways

Diversified DPO (DDPO) and ORPO (DORPO) successfully increase output diversity compared to their standard counterparts while maintaining similar quality levels.
The 'DDPO-both' variant (mixing semantic and style deviation) achieves the best balance, reaching diversity levels comparable to human-written text.
Existing instruction-tuned models (GPT-4o, Claude-3.5) cluster in a high-quality but low-diversity region, confirming the 'alignment tax' on creativity.
The approach is robust across model architectures (Llama-3 vs Mistral-7B), though Llama-3-8B generally performed better.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) post-training
Familiarity with Reinforcement Learning from Human Feedback (RLHF) concepts
Basic knowledge of vector embeddings and cosine distance

Key Terms

DPO: Direct Preference Optimization—a stable alternative to RLHF that optimizes a policy directly on preference pairs without an explicit reward model

ORPO: Odds Ratio Preference Optimization—a post-training method that uses odds ratios to contrast winning and losing responses without a reference model

SFT: Supervised Fine-Tuning—training a model on high-quality instruction-response pairs

deviation: A metric defined in this paper as the expected distance (dissimilarity) between a specific response and all other responses to the same prompt

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters

Gold: The original human-written responses from the dataset, used as a reference for natural diversity

reddit-reward: A quality score predicted by a reward model trained on upvotes from the r/WritingPrompts dataset