TL;DR: Vision tool-use RL enhances model performance by reducing tool-induced harm, but does not significantly improve tool-based correction of intrinsic failures.
Vision tool-use RL equips vision-language models with explicit visual operators (e.g., crop-and-zoom) and trains them to invoke tools during reasoning. Empirically, this yields sizeable performance gains on multimodal benchmarks. However, a central question remains insufficiently understood: what does vision tool-use RL actually learn?
Performance improvements may arise from three distinct sources: (1) strengthening the model's intrinsic capability (better perception even without tools); (2) improving tool use itself (better when-to-call decisions and execution quality); (3) reducing tool-induced side effects (fewer harmful calls, less schema interference). Existing evaluations typically report end-to-end tool-available accuracy, thereby hindering a mechanistic attribution of the gains.
We introduce MED (Measure-Explain-Diagnose), a coarse-to-fine analysis framework to disentangle intrinsic capability changes from tool-induced effects. MED quantifies how much performance change comes from intrinsic improvement, decomposes tool-induced effects into gain and harm components, and diagnoses the underlying mechanisms driving these changes.
Reinforcement learning (RL)-based post-training has recently been extended to multimodal settings, where vision-language models (VLMs) are equipped with visual operators such as crop-and-zoom to enable interactive perception. While this paradigm achieves strong performance gains on multimodal benchmarks, it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities.
The MED framework provides a systematic answer by disentangling intrinsic capability changes from tool-induced effects, decomposing the tool-induced performance difference into gain and harm terms, and probing the mechanisms driving their evolution through checkpoint-level analysis.
The MED framework provides a coarse-to-fine analysis of vision tool-use reinforcement learning through three sequential steps:
We measure progress as the change in accuracy from the initial checkpoint. The drift is defined as: $$f_{\mathrm{wo}}(t) = \mathrm{Acc}_{\mathrm{wo}}(t) - \mathrm{Acc}_{\mathrm{wo}}(0), \quad f_{\mathrm{w}}(t) = \mathrm{Acc}_{\mathrm{w}}(t) - \mathrm{Acc}_{\mathrm{w}}(0)$$ where \(f_{\mathrm{wo}}(t)\) measures intrinsic capability change (tool-free), and \(f_{\mathrm{w}}(t)\) measures end-to-end change when tool use is available.
We define the tool-induced performance gap at checkpoint \(t\): $$G(t) \triangleq \mathrm{Acc}_{\mathrm{w}}(t) - \mathrm{Acc}_{\mathrm{wo}}(t)$$ The evolution of this gap gives the tool-induced drift: \(\Delta_{\mathrm{tool}}(t) \triangleq G(t) - G(0)\). This yields an additive decomposition of tool-available drift: $$\underbrace{f_{\mathrm{w}}(t)}_{\text{Tool-available drift}} = \underbrace{f_{\mathrm{wo}}(t)}_{\text{Intrinsic drift}} + \underbrace{\Delta_{\mathrm{tool}}(t)}_{\text{Tool-induced drift}}$$
To summarize contributions over training, we measure the cumulative magnitude of each drift component: $$|B_{\mathrm{wo}}| = \int_{0}^{T} | f_{\mathrm{wo}}(t) |\,dt, \quad |B_{\Delta \mathrm{tool}}| = \int_{0}^{T} | \Delta_{\mathrm{tool}}(t) |\,dt$$ The tool contribution ratio is the fraction of total drift magnitude attributed to tool effects: $$S_{\mathrm{tool}} = \frac{|B_{\Delta \mathrm{tool}}|}{|B_{\mathrm{wo}}| + |B_{\Delta \mathrm{tool}}|}$$ When \(S_{\mathrm{tool}} \approx 0\), intrinsic drift dominates; when \(S_{\mathrm{tool}} \approx 1\), tool-induced drift dominates.
While \(S_{\text{tool}}\) quantifies the overall tool-induced drift magnitude, it does not explain the underlying dynamics. To gain deeper understanding, we decompose the performance gap \(G(t)\) based on the model's intrinsic capability.
Partitioning via Intrinsic Capability: At each checkpoint \(t\), intrinsic performance partitions the task set into two disjoint subsets: the failure set \(\mathcal{D}_{\text{fail}}(t)\) (where the model fails without tools) and the success set \(\mathcal{D}_{\text{succ}}(t)\) (where it succeeds). This defines the potential for improvement on \(\mathcal{D}_{\text{fail}}\) versus regression on \(\mathcal{D}_{\text{succ}}\).
By conditioning on tool usage (\(c\): calling the tool; \(\checkmark\)/\(\times\): correct/incorrect prediction), we obtain a four-term decomposition of \(G(t)\):
$$\begin{aligned} G(t) = & \underbrace{P(\mathcal D_{\text{fail}}) P(c \mid \mathcal D_{\text{fail}}) P(\checkmark \mid c, \mathcal D_{\text{fail}})}_{\text{Term 1: Call Gain}} \\ & + \underbrace{P(\mathcal D_{\text{fail}}) P(\neg c \mid \mathcal D_{\text{fail}}) P(\checkmark \mid \neg c, \mathcal D_{\text{fail}})}_{\text{Term 2: Schema Gain}} \\ & - \underbrace{P(\mathcal D_{\text{succ}}) P(c \mid \mathcal D_{\text{succ}}) P(\times \mid c, \mathcal D_{\text{succ}})}_{\text{Term 3: Call Harm}} \\ & - \underbrace{P(\mathcal D_{\text{succ}}) P(\neg c \mid \mathcal D_{\text{succ}}) P(\times \mid \neg c, \mathcal D_{\text{succ}})}_{\text{Term 4: Schema Harm}} \end{aligned}$$
By isolating Gross Gain (Terms 1+2) from Gross Harm (Terms 3+4), we can distinguish skill acquisition from spurious shifts, determining whether drift is Gain-dominant (emerging utility) or Harm-dominant (suppressed by interference).
While the 4-term decomposition explains what changes, each term is a product of three probabilities. The specific cause of a temporal shift remains unclear. For instance, a decline in Call Gain could result from: a shrinking failure set, lower calling probability, or degraded execution quality.
To pinpoint the root cause, we decompose each term into three factors: $$\text{Term}(\mathcal{D}, a, o) = \underbrace{P(\mathcal{D})}_{\text{Mass}} \cdot \underbrace{P(a \mid \mathcal{D})}_{\text{Policy}} \cdot \underbrace{P(o \mid a, \mathcal{D})}_{\text{Quality}}$$
Diagnostic Insights: This factorization uncovers two critical training dynamics: (1) Intrinsic-Tool Trade-off: As intrinsic capability improves, \(\mathcal{D}_{\text{fail}}\) shrinks, limiting the upper bound of Call Gain even if Quality improves; (2) Policy-Quality Decoupling: We distinguish learning to attempt (Policy) from learning to succeed (Quality).
Three Key Observations:
Detailed Analysis:
Root Cause Analysis:
Because \(\mathcal{D}_{\text{fail}}(t)\) shifts in difficulty over training, we control for this by evaluating on: (i) a fixed initial failure cohort \(\mathcal{D}_{\text{fail}}(0)\), and (ii) persistent failures that remain unsolved. We find that quality gains do not extend to the hardest remaining failures.
Synthesizing the above findings, we answer the central question. Contrary to the ideal of tool mastery, current vision tool-use RL learns a more conservative policy:
Ultimately, the model learns to safely coexist with the tool rather than master it.
In this work, we present a systematic analysis of what vision tool-use RL actually learns. By disentangling intrinsic capability drift from tool-induced effects, and further decomposing tool utility into gain, harm, and their underlying mechanisms, we show that performance improvements are dominated by intrinsic learning rather than by tool-induced effects.
Across models and benchmarks, vision tool-use RL mainly reduces tool-induced harm, while showing limited improvement in tool contribution. Overall, vision tool-use RL learns a conservative policy for VLMs that makes tool availability less harmful, but does not reliably extend tool utility beyond the intrinsic hard core.
Limitations: The analysis focuses on a single vision tool (crop-and-zoom); more complex tools or multi-tool settings may exhibit different dynamics. We analyze outcome-only RL with sparse rewards; tool-aware reward shaping or additional supervision may produce stronger execution learning. Future work could incorporate efficiency, tool-use traces, and more interpretability metrics.
@article{ma2026does,
title={What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom},
author={Ma, Yan and Zhang, Weiyu and Li, Tianle and Du, Linge and Shen, Xuyang and Liu, Pengfei},
journal={arXiv preprint arXiv:2602.01334},
year={2026}
}
Website template adapted from V*: Guided Visual Search.