Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Synthetic Data Generation LLM Post-training

GoldenGoose converts unstructured text into verifiable multiple-choice fill-in-the-middle tasks, allowing reinforcement learning to scale on reasoning-rich but previously unverifiable data like textbooks and web scrapes.

Core Problem

Scaling RLVR is bottlenecked by the scarcity of data with automatically verifiable ground truth (like math/code), causing models to saturate or regress when trained for too long on limited datasets.

Why it matters:

Current RLVR relies on expensive human-authored problems or limited handcrafted environments, excluding vast amounts of reasoning-rich text (e.g., medical diagnoses, theorem proofs) that are hard to verify.
Models like DeepSeek-R1 and Qwen-4B saturate early (plateauing after ~300 steps) when restricted to existing verifiable datasets.
Specialized domains like cybersecurity have almost no verifiable training data, preventing the application of modern reasoning RL techniques.

Concrete Example: In cybersecurity, a raw web scrape about a vulnerability exploit has no ground truth answer to check against. Standard RLVR cannot use it. GoldenGoose takes the text, masks the specific exploit step (e.g., 'buffer overflow'), generates plausible wrong options, and rewards the model for selecting the original text, effectively creating a verifiable task from noise.

Key Novelty

GoldenGoose (Fill-in-the-Middle MCQ Synthesis)

Transforms any raw text into a Multiple-Choice Question (MCQ) by identifying a key reasoning span, masking it, and treating the original text as the ground truth.
Uses a teacher LLM to generate diverse, plausible 'distractors' (incorrect options) matching the style of the masked span to ensure the task requires genuine reasoning, not just elimination.
Enables the use of 'unverifiable' corpora—like unstructured science textbooks or forum discussions—for verifiable reinforcement learning by validating against the original text.

Architecture

The GoldenGoose data synthesis pipeline transforming raw text into RLVR tasks.

Evaluation Highlights

+3.48% absolute gain in STEM benchmarks for ProRL-1.5B-v2 (a heavily RL-trained model), reviving it from saturation where previous data yielded only +0.13%.
+4.44% absolute gain across 3 cybersecurity benchmarks for Qwen-4B-Instruct using only 100 RL steps on GoldenGoose-Cyber, establishing a new SOTA.
Outperforms the domain-specialized Llama-Primus-Instruct (which used extensive pre/post-training) on cybersecurity tasks, despite using a general-purpose base model.

Breakthrough Assessment

8/10

Simple yet highly effective method to unlock unlimited training data for RLVR, addressing the critical data bottleneck. Demonstrated robust gains in both general reasoning and specialized domains.

⚙️ Technical Details

Problem Definition

Setting: Synthesizing verifiable RL tasks from an unverifiable source text corpus S

Inputs: Raw source text S (e.g., textbook passage, code snippet, web scrape)

Outputs: A multiple-choice question Q consisting of a masked context S_mask, a set of options including the ground truth t and distractors D, and a label for the correct option.

Modeling

Base Model: Qwen-4B-Instruct and ProRL-1.5B-v2 (derived from R1-Distill-Qwen-1.5B)

Training Method: ProRL recipe (variant of GRPO with clipped objective and decoupled advantage normalization)

Objective Functions:

Purpose: Optimize policy to maximize verifiable reward.

Formally: Clipped GRPO objective with group-wise mean subtraction and batch-level standardization.

Training Data:

GooseReason-0.7M: 0.7 million tasks synthesized from AoPS-Instruct (Math), MegaScience (Textbooks), and rStar-Coder (Code).
GooseReason-Cyber: 180K tasks synthesized from FineWeb cybersecurity scrapes.
Synthesis Pipeline: (1) Prompt Teacher LLM (GPT-5) to identify reasoning span t in source S. (2) Replace t with [MASK]. (3) Generate k plausible distractors D. (4) Filter easy samples (where student succeeds 16/16 times).

Key Hyperparameters:

training_steps_cyber: 100
training_steps_scratch: 200
additional_training_hours: 1100 H100 GPU hours (for ProRL-1.5B-v2 continuation)

Compute: Synthesized data used to train models for up to 1,100 H100 GPU hours.

Comparison to Prior Work

vs. RLVE: RLVE relies on manual environment design limited to formal logic/math; GoldenGoose scales to unstructured domains (biology, cyber) using raw text.
vs. Llama-Primus-Instruct: Primus relies on SFT/Continued Pre-training; GoldenGoose enables RLVR on the same noisy web data, achieving higher gains with less compute.
vs. Standard FIM [not cited in paper]: Standard FIM predicts tokens; GoldenGoose formulates it as a reasoning-heavy MCQ task with adversarial distractors to force verification.

Limitations

Depends on a strong teacher model (e.g., GPT-5) to generate high-quality distractors and identify reasoning spans.
MCQ format may allow models to use elimination strategies rather than pure generation if distractors are not high quality.
Requires filtering of 'easy' tasks where the answer is obvious from context without reasoning.

Reproducibility

Synthesis prompt templates provided in Appendix A. Source corpora (FineWeb, AoPS, MegaScience) are public. Code availability is not explicitly provided in the text.

📊 Experiments & Results

Evaluation Setup

Reinforcement Learning fine-tuning on synthesized data, evaluated on diverse reasoning benchmarks.

Benchmarks:

AIME 2024/2025 (Competition Math)
GPQA Diamond (Graduate-Level Science QA)
HumanEvalPlus (Code Generation)
CTI-Bench (Cyber Threat Intelligence)

Metrics:

Pass@1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reviving saturated models: Performance gains on ProRL-1.5B-v2 after continued training with GooseReason vs. original ProRL data.
STEM (Avg)	Accuracy	0.13	3.48	+3.35
Math (Avg)	Accuracy	0.63	2.71	+2.08
Cybersecurity domain adaptation: Training Qwen-4B-Instruct on GooseReason-Cyber.
Cybersecurity Avg (3 benchmarks)	Accuracy	Not reported in the paper	Not reported in the paper	+4.44
Cybersecurity Avg (3 benchmarks)	Gain over Base	1.44	4.44	+3.00

Experiment Figures

Performance trajectories of Qwen-4B-Instruct during RL training on different data mixtures.

Comparison of 'effective' vs 'stale' examples in ProRL data vs GooseReason.

Main Takeaways

Data saturation is a critical bottleneck for stronger models; Qwen-4B plateaus after only 300 steps on existing data, while GooseReason enables continuous gains.
The 'Fill-in-the-Middle MCQ' trick successfully transfers reasoning skills to open-ended tasks (Math, Code) despite the training data being multiple-choice.
Synthesizing data from raw internet text (GooseReason-Cyber) outperforms domain-specialized pre-training (Llama-Primus) for specialized domains.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Language Model Post-training (SFT, RL)
Fill-in-the-Middle (FIM) objectives

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using outcomes that can be automatically checked (e.g., code compilation, correct math answer).

Fill-in-the-Middle (FIM): A training objective where the model predicts a missing span of text given the surrounding prefix and suffix.

Distractors: Incorrect options in a multiple-choice question designed to be plausible enough to confuse a model that isn't reasoning correctly.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same prompt, used here as the training recipe.

FineWeb: A large-scale, high-quality dataset of web text used for LLM pre-training.

ProRL: A specific RL training recipe (and resulting model series) focusing on verifiable rewards and process supervision.

AoPS: Art of Problem Solving—a forum and curriculum for high-difficulty mathematics, often used as a source for reasoning data.