← Back to Paper List

UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward

Yile Liu, Yixian Liu, Zong-Rui Li, Yufei Huang, Xinhua Feng, Zhichao Hu, Jinglu Hu, Jia-Xin Yan, Fengzong Lian, Yuhong Liu
Hunyuan, Tencent, Waseda University
arXiv.org (2026)
Reasoning RL Benchmark

📝 Paper Summary

Synthetic Data Generation Reinforcement Learning with Verifiable Rewards (RLVR) General Reasoning
UltraLogic enhances general reasoning by synthesizing diverse, difficulty-calibrated data via code-logic decoupling and optimizing models using a Bipolar Float Reward that penalizes partial logical flaws.
Core Problem
General-purpose reasoning lacks the large-scale, high-quality, and difficulty-calibrated training data available for math or code; furthermore, standard binary RL rewards are too sparse to guide models through complex logic.
Why it matters:
  • Current RLVR successes are limited to domains with automatic verification (math/code), leaving general reasoning bottlednecked by data scarcity
  • Existing reasoning datasets lack controllable difficulty calibration, making it hard to manage the 'Zone of Proximal Development' for efficient model training
  • Binary (0/1) rewards fail to distinguish between 'fundamentally wrong' and 'partially correct' reasoning, slowing down convergence
Concrete Example: In a complex logic puzzle, a binary reward treats a completely hallucinated answer and an answer with a single minor step error exactly the same (Reward=0), failing to provide the model with granular feedback on its partial progress.
Key Novelty
Code-based Solving Framework & Bipolar Float Reward
  • Decouples logical cores (Python code) from natural language (templates) to programmatically generate infinite, verifiable reasoning problems
  • Implements an automated 'Difficulty Control Module' that tunes code parameters until model success rates match a calibrated 1-10 scale
  • Introduces Bipolar Float Reward (BFR) to provide graded, potentially negative feedback for logical flaws, offering denser signals than binary pass/fail
Architecture
Architecture Figure Figure 1
The UltraLogic Code-based Solving Framework architecture and workflow
Breakthrough Assessment
7/10
Addresses the critical bottleneck of data scarcity in general reasoning with a scalable, verifiable synthesis pipeline. The difficulty calibration loop is a strong methodological contribution.
×