DeepSeek AI, a prominent player in the large language model arena, has recently published a research paper detailing a new technique aimed at enhancing the scalability of general reward models (GRMs) during the inference phase. Simultaneously, the company has hinted at the imminent arrival of its next-generation model, R2, building anticipation within the AI community.
The paper, titled “Inference-Time Scaling for Generalist Reward Modeling” introduces a novel method that allows GRMs to optimize reward generation by dynamically producing principles and critiques. This is achieved through rejection fine-tuning and rule-based online reinforcement learning [1-1].
This development comes at a time when the paradigm for scaling LLMs is shifting from the pre-training stage to post-training, particularly the inference phase, following the emergence of models like OpenAI’s o1. This approach leverages increased reinforcement learning (computational effort during training) and more extensive “thinking time” (computational effort during testing) to continually improve model performance. Notably, o1 generates a lengthy internal chain of thought before responding to users, refining its reasoning process, exploring different strategies, and identifying its own errors.
DeepSeek’s own R1 series of models has further validated the potential of pure reinforcement learning training (without relying on supervised fine-tuning) to achieve significant leaps in LLM reasoning capabilities.
The fundamental “next token prediction” mechanism of LLMs, while providing vast knowledge, often lacks deep planning and the ability to predict long-term outcomes, making them susceptible to short-sighted decisions. Reinforcement learning serves as a crucial complement, providing LLMs with an “Internal World Model.” This enables them to simulate the potential outcomes of different reasoning paths, evaluate the quality of these paths, and select superior solutions, ultimately leading to more systematic long-term planning. The synergy between LLMs and RL is increasingly recognized as key to enhancing the ability to solve complex problems.
Wu Yi, an assistant professor at Tsinghua’s Institute for Interdisciplinary Information Sciences (IIIS), likened the relationship between LLMs and reinforcement learning to a “multiplicative relationship” in a recent podcast. While reinforcement learning excels in decision-making, it inherently lacks understanding. The construction of understanding relies on pre-trained models, upon which reinforcement learning can then further optimize decision-making capabilities. This “multiplicative relationship” suggests that only when a strong foundation of understanding, memory, and logical reasoning is built during pre-training can reinforcement learning fully unlock its potential to create a complete intelligent agent [1-2].
A comprehensive survey paper titled “Reinforcement Learning Enhanced LLMs: A Survey” outlines the typical three-step process of using RL to train LLMs:
- Reward Model Training: Before fine-tuning, a reward model (or reward function) is trained to approximate human preferences and evaluate different LLM outputs.
- Preference-Based Fine-Tuning: In each fine-tuning iteration, the large language model generates multiple responses to a given instruction, and each response is scored using the trained reward model.
- Policy Optimization: Reinforcement learning optimization techniques are used to update the model’s weights based on the preference scores, aiming to improve response generation.
Integrating reinforcement learning allows large language models to dynamically adjust based on varying preference scores, moving beyond the limitations of a single, pre-determined answer.
DeepSeek’s SPCT: Addressing the Scaling Challenges of RL for LLMs
Despite the success of reinforcement learning in post-training as a breakthrough for enhancing LLM performance, reinforcement learning algorithms themselves still have significant room for improvement, and the “Scaling Laws” of reinforcement learning are still in their nascent stages.
Unlike traditional scaling laws that focus on increasing data and compute to improve model performance, the scaling laws for reinforcement learning are influenced by more complex factors, including sample throughput, model parameter size, and the complexity of the training environment.
A major hurdle in the scaling of reinforcement learning is reward sparsity. The reward model is a critical component, and generating accurate reward signals is paramount. Achieving both generalization and continuity in reward models is a key focus.
DeepSeek and Tsinghua researchers addressed this challenge in their recent work by exploring the scalability and generalization of reward models at inference time. Their proposed Self-Principled Critique Tuning (SPCT) method aims to improve the scalability of general reward modeling during inference.
The SPCT approach involves two key stages:
- Rejection Fine-Tuning: This serves as a cold start, enabling the GRM to adapt to generating principles and critiques in the correct format and type.
- Rule-Based Online RL: This stage further optimizes the generation of principles and critiques.
To achieve effective inference-time scaling, the researchers employed parallel sampling to maximize computational utilization. By sampling multiple times, the DeepSeek-GRM can generate different sets of principles and critiques and select the final reward through voting. Furthermore, a meta-reward model (Meta RM) is trained to guide the voting process, further enhancing scaling performance. The Meta RM is a point-to-point scalar reward model designed to identify the correctness of the principles and critiques generated by the DeepSeek-GRM.
Experimental results demonstrated that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models on multiple comprehensive RM benchmarks without significant domain bias.
Looking Ahead: DeepSeek R2 on the Horizon
While the research paper focuses on advancements in reward modeling and inference-time scaling, the mention of DeepSeek’s R1 series and the implicit progression suggests that the company is actively developing its next-generation model, R2. Given DeepSeek’s emphasis on pure reinforcement learning for enhancing reasoning, it is highly anticipated that R2 will incorporate and build upon the insights gained from this latest research on scalable reward models.
The AI community will be keenly watching for further announcements regarding DeepSeek R2, eager to see how the company leverages its innovative approaches to reinforcement learning and inference optimization to push the boundaries of large language model capabilities. The focus on scalable reward models hints at a potential emphasis on even more sophisticated self-evaluation and improvement mechanisms within their next flagship model.
The paper Inference-Time Scaling for Generalist Reward Modeling is on arXiv.