Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Generative Reward Modeling Creative Writing

Writing-Zero enables effective reinforcement learning for subjective writing tasks by combining a pairwise generative reward model with a bootstrapped algorithm that uses self-generated responses as dynamic references.

Core Problem

Non-verifiable tasks like creative writing lack objective ground-truth answers, forcing reliance on scalar reward models that are prone to reward hacking and poor generalization.

Why it matters:

Standard RLHF often leads to 'length bias' where models generate verbose, vacuous content just to satisfy the reward model
Current RLVR success is limited to math/code; extending it to subjective domains is necessary for comprehensive LLM development
Scalar reward models fail to capture nuanced human preferences compared to comparative/pairwise assessments

Concrete Example: In creative writing, models trained with standard scalar rewards often exhibit 'over-explanation,' appending lengthy, redundant justifications of how they met user requirements to the end of a response, even when the actual content is poor.

Key Novelty

Writing-Zero (GenRM + BRPO)

Use a Pairwise Generative Reward Model (GenRM) that produces text critiques and scores for response pairs, converting subjective quality into pseudo-verifiable binary signals
Introduce Bootstrapped Relative Policy Optimization (BRPO), where the model compares its outputs against a randomly selected 'peer' from the same batch (bootstrap) rather than a fixed external baseline

Architecture

Comparison of GRPO (Group Relative Policy Optimization) and BRPO (Bootstrapped Relative Policy Optimization) architectures.

Evaluation Highlights

Writing-Zero improves base model performance on WritingBench from 6.89 to 8.29, without supervised fine-tuning
Reduces reward hacking significantly: 'mean explanation length' drops from 417 tokens (scalar reward baseline) to 58 tokens (Writing-Zero)
The Pairwise Writing GenRM outperforms Claude-3.5-Sonnet on RewardBench (87.4% vs 84.2%) despite being trained primarily on Chinese data

Breakthrough Assessment

8/10

Successfully extends the 'Zero' (pure RL) paradigm from reasoning to creative writing. The bootstrap mechanism for relative policy optimization is a clever solution to the lack of ground truth.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement learning for non-verifiable generation tasks (e.g., creative writing) where no objective ground truth exists.

Inputs: Natural language prompt q

Outputs: Generated text response o

Pipeline Flow

Group Rollout: Policy generates G responses
Reference Selection: One response o_ref is randomly selected from G
Pairwise Evaluation: GenRM compares all other responses against o_ref
Advantage Calculation: Binary win/loss becomes the advantage
Policy Update: Update weights to maximize expected advantage

System Modules

Policy Model

Generate a group of candidate responses for a given prompt

Model or implementation: Qwen3-32B-Base (or SFT variant)

Pairwise Writing GenRM (Evaluation)

Evaluate pairs of responses to determine which is better based on writing principles

Model or implementation: Qwen3-32B-Base fine-tuned on preference data

Bootstrap Reference Selector (Evaluation)

Dynamically select a reference response from the current generation group

Model or implementation: Random Selection logic

Novel Architectural Elements

Bootstrapped reference mechanism: Replacing fixed reference models or group means with a dynamically sampled peer from the current batch for relative comparison

Modeling

Base Model: Qwen3-32B-Base

Training Method: Bootstrapped Relative Policy Optimization (BRPO)

Objective Functions:

Purpose: Maximize expected reward while constraining deviation from old policy.

Formally: Standard PPO-style clipped surrogate objective.
Purpose: Calculate advantage using pairwise preference against a bootstrapped reference.

Formally: A_i = 1 if Score(o_i) > Score(o_ref), else -1.

Training Data:

GenRM: ~10K pairwise preferences filtered from 200K in-house data
Policy: In-house unsupervised queries

Key Hyperparameters:

learning_rate: 1e-6
dynamic_sampling_threshold: 0.6 (filters query if >60% of group beats the reference)
temperature: 1.0
+ 2 more
top_p: 1.0
margin_threshold: 2 (for GenRM score margin)

Compute: Trained using vLLM engine; GenRM training noted as computationally expensive due to high dropout rate (95%) in dynamic sampling

Comparison to Prior Work

vs. GRPO: BRPO uses pairwise comparison against a dynamic peer reference instead of scalar group normalization
vs. Standard RLHF: Uses GenRM and verifiability-like binary signals instead of a scalar reward model
vs. DeepSeek-R1-Zero: Targets non-verifiable (subjective) tasks rather than verifiable (math/code) tasks

Limitations

GenRM training was halted prematurely due to high computational costs from dynamic sampling (95% drop rate)
Eval RM is trained on in-house data, limiting external reproducibility of the specific metric gains
Reliance on internal proprietary datasets for both preference training and RL prompts

Reproducibility

Code availability is not provided. Training relies on internal datasets (200K pairwise preferences, internal writing testsets). Base model Qwen3-32B is public. GenRM training details (cold start, dynamic sampling) are described but weights are not released.

📊 Experiments & Results

Evaluation Setup

Evaluation of both the Reward Model (on benchmarks) and the Policy Model (on writing tasks).

Benchmarks:

WritingBench (Generative Writing)
RewardBench (Reward Model Evaluation)
Writing Testset (In-house diverse user queries) [New]

Metrics:

Win Rate / Score (via Eval RM)
Response Length
Explanation Length (redundancy metric)
Accuracy (for Reward Model)
Statistical methodology: Human evaluation ratios reported (Win:Tie:Loss)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Policy model performance: Writing-Zero (RL from base) outperforms the base model and scalar-reward baselines, showing the effectiveness of BRPO.
WritingBench	Score	6.89	8.29	+1.40
Writing Testset	Score (Eval RM)	1.23	3.84	+2.61
Internal Test Set	Mean Explanation Length (tokens)	417	58	-359
Reward Model performance: The Pairwise Writing GenRM achieves strong results on standard benchmarks, validating its use as a training signal.
RewardBench	Accuracy	84.2	87.4	+3.2

Main Takeaways

RLVR can be successfully adapted to non-verifiable tasks by treating pairwise GenRM judgments as verifiable signals
Writing-Zero (training from scratch via RL) is viable for creative writing, achieving competitive results without SFT
Pairwise GenRMs are significantly more robust to reward hacking (length bias, over-explanation) than scalar reward models
Test-time scaling (voting@n) with GenRM further improves policy performance (Writing-Zero score improves from 8.29 to 8.35 with voting@2)

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Reward Modeling

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL where rewards are determined by objective checks (e.g., code compilation, math answers)

GenRM: Generative Reward Model—a model that evaluates responses by generating a textual critique and score rather than just outputting a scalar value

BRPO: Bootstrapped Relative Policy Optimization—the proposed algorithm that uses a randomly selected response from the current batch as a temporary reference for advantage estimation

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs to estimate advantages without a value function

SFT: Supervised Fine-Tuning—training a model on high-quality instruction-response pairs

Reward Hacking: When an RL agent exploits flaws in the reward model (e.g., by writing longer text) to maximize score without improving actual quality

Bootstrapping: In this context, using the model's own current outputs as a reference point for evaluation, rather than external data

Writing-Zero: The specific model variant trained from a base model using BRPO without prior supervised fine-tuning

Voting@n: A test-time scaling technique where the reward model evaluates multiple permutations or samples to determine the final score