Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining.
To address these challenges, we propose LMM-R1, a two-stage framework adapting rule-based RL for multimodal reasoning through Foundational Reasoning Enhancement (FRE) followed by Multimodal Generalization Training (MGT). The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains.
Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83% and 4.5% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.
Recent advances in Large Language Models have shown promising results in reasoning tasks. However, extending these capabilities to multimodal domains presents unique challenges, particularly for compact 3B-parameter Large Multimodal Models (LMMs). These models face two critical limitations:
Rule-based RL requires uniquely verifiable answers for accurate rewards. However, multimodal tasks often involve answer ambiguity (e.g., image descriptions, visual QA). Additionally, while perception-focused data is abundant, complex reasoning examples are limited, potentially leading to insufficient reasoning capabilities.
Models trained on multimodal data often show degraded performance on text-only tasks. Some LMMs using Chain-of-Thought actually experience performance degradation on multimodal benchmarks - a problem that is amplified in smaller 3B-parameter architectures due to their limited capacity.
To address these challenges, we propose LMM-R1, a two-stage rule-based RL framework that first strengthens foundational reasoning abilities using text-only data before generalizing to multimodal domains. This approach overcomes the architectural constraints of 3B LMMs while avoiding the need for extensive multimodal training data.
The FRE stage focuses on enhancing the model's foundational reasoning capabilities through two approaches:
The MGT stage extends reasoning capabilities across diverse multimodal domains:
We evaluate LMM-R1 on a variety of multimodal and text-only benchmarks.
LMM-R1 achieves significant performance improvements across both multimodal and text-only reasoning benchmarks. Our two-stage approach demonstrates remarkable effectiveness: the Foundational Reasoning Enhancement (FRE) stage using text-only data improves reasoning capabilities by 4.29% on text-only tasks, while the subsequent Multimodal Generalization Training (MGT) stage successfully transfers these enhanced reasoning abilities to multimodal contexts. The MGT-PerceReason model achieves a 4.83% average improvement on multimodal benchmarks compared to the baseline, with particularly strong gains on reasoning-intensive tasks. Notably, our approach effectively addresses the typical trade-off between reasoning and perception capabilities, enabling simultaneous improvement in both areas without the need for extensive high-quality multimodal training data.
FRE-Text demonstrates significant improvements in reasoning capabilities:
FRE-Multi shows strong visual capabilities but with reasoning trade-offs:
We continue rule-based RL training in three distinct domains: geometry-focused visual reasoning, perception-reasoning balanced tasks, and agent-based sequential decision making, demonstrating strong generalization capabilities across different multimodal scenarios.
MGT-Geo demonstrates exceptional performance in geometry-specific tasks, showing strong generalization capabilities:
MGT-PerceReason demonstrates balanced improvements across diverse multimodal tasks:
Visualization of MGT-Sokoban solving a complex puzzle requiring multi-step planning
MGT-Sokoban demonstrates remarkable capabilities in sequential decision-making and planning:
The following examples demonstrate how our two-stage training approach enhances reasoning capabilities across different types of problems. These examples highlight the differences in reasoning patterns between the baseline model and our enhanced models.
How many positive integers b have the property that log_b 729 is a positive integer?
What is the median number of points scored by the team per game?
How many vehicles in the image have wheels?
Question: What is the purpose of the left lane in the picture?
Choices:
(A) To show the results of immunofluorescent labeling
(B) To indicate the upper layer of synovial membranes
(C) To show the magnification of the image
(D) To display the results of immunohistochemistry
These examples highlight the distinct reasoning patterns that emerge from our different training approaches. The text-only trained model (FRE-Text) demonstrates significantly more detailed and thorough reasoning processes with:
In contrast, the multimodal trained model (FRE-Multi) exhibits more concise reasoning that prioritizes efficiency:
This pattern aligns with our research findings: text-only rule-based RL training significantly enhances reasoning depth and thoroughness, while multimodal training optimizes for efficient visual perception at the cost of detailed reasoning. Our two-stage approach successfully leverages these complementary strengths, enabling models to maintain strong reasoning capabilities while effectively processing visual information.
@article{peng2025lmmr1,
title={LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL},
author={Peng, Yingzhe and Zhang, Gongrui and Zhang, Miaosen and You, Zhiyuan and Liu, Jie and Zhu, Qipeng and Yang, Kai and Xu, Xingzhong and Geng, Xin and Yang, Xu},
journal={arXiv preprint},
year={2025}
}