LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng^1,4,*, Gongrui Zhang^1,*, Miaosen Zhang^1,*, Zhiyuan You², Jie Liu², Qipeng Zhu³,
Kai Yang⁴, Xingzhong Xu⁴, Xin Geng¹, Xu Yang^1,†

¹Key Laboratory of New Generation Artificial Intelligence Technology and
Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
²The Chinese University of Hong Kong ³Fudan University ⁴Ant Group
^*Equal Contribution. ^†Corresponding Author

Paper 🤗 Models and Datasets Code

LMM-R1 enhances reasoning in compact 3B-parameter Large Multimodal Models through a novel two-stage framework: Foundational Reasoning Enhancement followed by Multimodal Generalization Training.

Abstract

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining.

To address these challenges, we propose LMM-R1, a two-stage framework adapting rule-based RL for multimodal reasoning through Foundational Reasoning Enhancement (FRE) followed by Multimodal Generalization Training (MGT). The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains.

Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83% and 4.5% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

Motivation

Recent advances in Large Language Models have shown promising results in reasoning tasks. However, extending these capabilities to multimodal domains presents unique challenges, particularly for compact 3B-parameter Large Multimodal Models (LMMs). These models face two critical limitations:

Data Limitations

Rule-based RL requires uniquely verifiable answers for accurate rewards. However, multimodal tasks often involve answer ambiguity (e.g., image descriptions, visual QA). Additionally, while perception-focused data is abundant, complex reasoning examples are limited, potentially leading to insufficient reasoning capabilities.

Weak Foundational Reasoning

Models trained on multimodal data often show degraded performance on text-only tasks. Some LMMs using Chain-of-Thought actually experience performance degradation on multimodal benchmarks - a problem that is amplified in smaller 3B-parameter architectures due to their limited capacity.

To address these challenges, we propose LMM-R1, a two-stage rule-based RL framework that first strengthens foundational reasoning abilities using text-only data before generalizing to multimodal domains. This approach overcomes the architectural constraints of 3B LMMs while avoiding the need for extensive multimodal training data.

LMM-R1: Two-Stage Training Framework

Foundational Reasoning Enhancement (FRE)

The FRE stage focuses on enhancing the model's foundational reasoning capabilities through two approaches:

Text-Only Enhancement: Uses large-scale, high-quality verifiable text-only data for rule-based RL training to develop strong reasoning foundations
Foundational Skills: Builds core reasoning abilities that serve as a basis for multimodal learning

Multimodal Generalization Training (MGT)

The MGT stage extends reasoning capabilities across diverse multimodal domains:

General Domain: Includes geometric reasoning and perception-reasoning balanced tasks across 20+ datasets
Agent Domain: Focuses on sequential decision-making tasks like Sokoban planning and football game scenarios

Results

We evaluate LMM-R1 on a variety of multimodal and text-only benchmarks.

LMM-R1 achieves significant performance improvements across both multimodal and text-only reasoning benchmarks. Our two-stage approach demonstrates remarkable effectiveness: the Foundational Reasoning Enhancement (FRE) stage using text-only data improves reasoning capabilities by 4.29% on text-only tasks, while the subsequent Multimodal Generalization Training (MGT) stage successfully transfers these enhanced reasoning abilities to multimodal contexts. The MGT-PerceReason model achieves a 4.83% average improvement on multimodal benchmarks compared to the baseline, with particularly strong gains on reasoning-intensive tasks. Notably, our approach effectively addresses the typical trade-off between reasoning and perception capabilities, enabling simultaneous improvement in both areas without the need for extensive high-quality multimodal training data.

Foundational Reasoning Enhancement

Text-Only Enhancement (FRE-Text) 📚

FRE-Text demonstrates significant improvements in reasoning capabilities:

📈 4.29% overall enhancement in text-only performance
👑 Strong transfer to multimodal reasoning: 5.34% gain on OlympiadBench
🎯 Text-only Data can improve multimodal reasoning performance 3.23%

Multimodal Enhancement (FRE-Multi) 🖼️

FRE-Multi shows strong visual capabilities but with reasoning trade-offs:

👁️ 7.36% improvement on MM-Star visual tasks and 3.5% gain on MathVista general multimodal tasks
⚠️ Slight decline in pure reasoning tasks

Multimodal Generalization Training

We continue rule-based RL training in three distinct domains: geometry-focused visual reasoning, perception-reasoning balanced tasks, and agent-based sequential decision making, demonstrating strong generalization capabilities across different multimodal scenarios.

Geometry Domain (MGT-Geo) 📐

MGT-Geo demonstrates exceptional performance in geometry-specific tasks, showing strong generalization capabilities:

📊 3.35% improvement on MathVision geometry tasks across Analytic, Combinatorial, Metric and Solid geometry
🔄 2.97% improvement on MathVerse geometry problems from Text Domain to Vision Only categories
⚡ 11.68% gain in vision-only geometric reasoning compared to FRE-Text baseline
🎯 Significant improvements in both perception and reasoning capabilities for geometry-specific tasks

Perception-Reasoning Balanced Domain (MGT-PerceReason) 🔍

MGT-PerceReason demonstrates balanced improvements across diverse multimodal tasks:

📈 1.6% average improvement across all multimodal benchmarks
👁️ 1.8% gain on MathVista visual understanding tasks
💡 2.88% enhancement on MM-Star general perception tasks
🔍 Maintains strong reasoning capabilities while improving visual perception

Agent Domain (MGT-Sokoban) 🤖

Visualization of MGT-Sokoban solving a complex puzzle requiring multi-step planning

MGT-Sokoban demonstrates remarkable capabilities in sequential decision-making and planning:

🎮 47.51% success rate on Sokoban tasks, outperforming baseline by 5.56%
🌟 18.99% success rate on unseen Football environment scenarios
🎲 Strong zero-shot transfer to novel agent environments

Reasoning comparison on geometric problem

Comparison of reasoning approaches on a geometric problem. LMM-R1 demonstrates superior reasoning by correctly applying the Pythagorean theorem.

Overview

Model Output Examples

The following examples demonstrate how our two-stage training approach enhances reasoning capabilities across different types of problems. These examples highlight the differences in reasoning patterns between the baseline model and our enhanced models.

Text-only Reasoning
Multimodal Reasoning
Visual Perception
Scientific Understanding

Question

How many positive integers b have the property that log_b 729 is a positive integer?

Baseline (Qwen2.5-VL)

FRE-Text

Question

What is the median number of points scored by the team per game?

Baseline (Qwen2.5-VL)

FRE-Multi

Question

How many vehicles in the image have wheels?

Baseline (Qwen2.5-VL)

FRE-Multi

Question

Question: What is the purpose of the left lane in the picture?

Choices:

(A) To show the results of immunofluorescent labeling

(B) To indicate the upper layer of synovial membranes

(C) To show the magnification of the image

(D) To display the results of immunohistochemistry

Baseline (Qwen2.5-VL)

FRE-Multi

These examples highlight the distinct reasoning patterns that emerge from our different training approaches. The text-only trained model (FRE-Text) demonstrates significantly more detailed and thorough reasoning processes with:

Explicit step-by-step mathematical derivations
Comprehensive exploration of problem-solving approaches
Detailed explanations of underlying concepts and principles
Systematic verification of answers through multiple methods

In contrast, the multimodal trained model (FRE-Multi) exhibits more concise reasoning that prioritizes efficiency:

More direct identification of relevant visual elements
Streamlined reasoning processes with fewer intermediate steps
Greater focus on perceptual details rather than abstract reasoning
Simplified explanations that get to the answer more quickly

This pattern aligns with our research findings: text-only rule-based RL training significantly enhances reasoning depth and thoroughness, while multimodal training optimizes for efficient visual perception at the cost of detailed reasoning. Our two-stage approach successfully leverages these complementary strengths, enabling models to maintain strong reasoning capabilities while effectively processing visual information.

BibTeX

@article{peng2025lmmr1,
  title={LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL},
  author={Peng, Yingzhe and Zhang, Gongrui and Zhang, Miaosen and You, Zhiyuan and Liu, Jie and Zhu, Qipeng and Yang, Kai and Xu, Xingzhong and Geng, Xin and Yang, Xu},
  journal={arXiv preprint},
  year={2025}
}