What Does Vision Tool-Use Reinforcement Learning Really Learn?
Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

MED Logo

TL;DR: Vision tool-use RL enhances model performance by reducing tool-induced harm, but does not significantly improve tool-based correction of intrinsic failures.

Introduction

Vision tool-use RL equips vision-language models with explicit visual operators (e.g., crop-and-zoom) and trains them to invoke tools during reasoning. Empirically, this yields sizeable performance gains on multimodal benchmarks. However, a central question remains insufficiently understood: what does vision tool-use RL actually learn?

Performance improvements may arise from three distinct sources: (1) strengthening the model's intrinsic capability (better perception even without tools); (2) improving tool use itself (better when-to-call decisions and execution quality); (3) reducing tool-induced side effects (fewer harmful calls, less schema interference). Existing evaluations typically report end-to-end tool-available accuracy, thereby hindering a mechanistic attribution of the gains.

We introduce MED (Measure-Explain-Diagnose), a coarse-to-fine analysis framework to disentangle intrinsic capability changes from tool-induced effects. MED quantifies how much performance change comes from intrinsic improvement, decomposes tool-induced effects into gain and harm components, and diagnoses the underlying mechanisms driving these changes.

Overview

Reinforcement learning (RL)-based post-training has recently been extended to multimodal settings, where vision-language models (VLMs) are equipped with visual operators such as crop-and-zoom to enable interactive perception. While this paradigm achieves strong performance gains on multimodal benchmarks, it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities.

The MED framework provides a systematic answer by disentangling intrinsic capability changes from tool-induced effects, decomposing the tool-induced performance difference into gain and harm terms, and probing the mechanisms driving their evolution through checkpoint-level analysis.

Key Findings

The MED Framework

The MED framework provides a coarse-to-fine analysis of vision tool-use reinforcement learning through three sequential steps:

MED Framework
The MED framework: Measure quantifies drift components, Explain decomposes into gain/harm terms, and Diagnose factorizes each term into Mass × Policy × Quality factors.

1. Measure: Quantifying Drift Components

We measure progress as the change in accuracy from the initial checkpoint. The drift is defined as: $$f_{\mathrm{wo}}(t) = \mathrm{Acc}_{\mathrm{wo}}(t) - \mathrm{Acc}_{\mathrm{wo}}(0), \quad f_{\mathrm{w}}(t) = \mathrm{Acc}_{\mathrm{w}}(t) - \mathrm{Acc}_{\mathrm{w}}(0)$$ where \(f_{\mathrm{wo}}(t)\) measures intrinsic capability change (tool-free), and \(f_{\mathrm{w}}(t)\) measures end-to-end change when tool use is available.

We define the tool-induced performance gap at checkpoint \(t\): $$G(t) \triangleq \mathrm{Acc}_{\mathrm{w}}(t) - \mathrm{Acc}_{\mathrm{wo}}(t)$$ The evolution of this gap gives the tool-induced drift: \(\Delta_{\mathrm{tool}}(t) \triangleq G(t) - G(0)\). This yields an additive decomposition of tool-available drift: $$\underbrace{f_{\mathrm{w}}(t)}_{\text{Tool-available drift}} = \underbrace{f_{\mathrm{wo}}(t)}_{\text{Intrinsic drift}} + \underbrace{\Delta_{\mathrm{tool}}(t)}_{\text{Tool-induced drift}}$$

To summarize contributions over training, we measure the cumulative magnitude of each drift component: $$|B_{\mathrm{wo}}| = \int_{0}^{T} | f_{\mathrm{wo}}(t) |\,dt, \quad |B_{\Delta \mathrm{tool}}| = \int_{0}^{T} | \Delta_{\mathrm{tool}}(t) |\,dt$$ The tool contribution ratio is the fraction of total drift magnitude attributed to tool effects: $$S_{\mathrm{tool}} = \frac{|B_{\Delta \mathrm{tool}}|}{|B_{\mathrm{wo}}| + |B_{\Delta \mathrm{tool}}|}$$ When \(S_{\mathrm{tool}} \approx 0\), intrinsic drift dominates; when \(S_{\mathrm{tool}} \approx 1\), tool-induced drift dominates.

2. Explain: 4-Term Decomposition

While \(S_{\text{tool}}\) quantifies the overall tool-induced drift magnitude, it does not explain the underlying dynamics. To gain deeper understanding, we decompose the performance gap \(G(t)\) based on the model's intrinsic capability.

Partitioning via Intrinsic Capability: At each checkpoint \(t\), intrinsic performance partitions the task set into two disjoint subsets: the failure set \(\mathcal{D}_{\text{fail}}(t)\) (where the model fails without tools) and the success set \(\mathcal{D}_{\text{succ}}(t)\) (where it succeeds). This defines the potential for improvement on \(\mathcal{D}_{\text{fail}}\) versus regression on \(\mathcal{D}_{\text{succ}}\).

By conditioning on tool usage (\(c\): calling the tool; \(\checkmark\)/\(\times\): correct/incorrect prediction), we obtain a four-term decomposition of \(G(t)\):

$$\begin{aligned} G(t) = & \underbrace{P(\mathcal D_{\text{fail}}) P(c \mid \mathcal D_{\text{fail}}) P(\checkmark \mid c, \mathcal D_{\text{fail}})}_{\text{Term 1: Call Gain}} \\ & + \underbrace{P(\mathcal D_{\text{fail}}) P(\neg c \mid \mathcal D_{\text{fail}}) P(\checkmark \mid \neg c, \mathcal D_{\text{fail}})}_{\text{Term 2: Schema Gain}} \\ & - \underbrace{P(\mathcal D_{\text{succ}}) P(c \mid \mathcal D_{\text{succ}}) P(\times \mid c, \mathcal D_{\text{succ}})}_{\text{Term 3: Call Harm}} \\ & - \underbrace{P(\mathcal D_{\text{succ}}) P(\neg c \mid \mathcal D_{\text{succ}}) P(\times \mid \neg c, \mathcal D_{\text{succ}})}_{\text{Term 4: Schema Harm}} \end{aligned}$$

By isolating Gross Gain (Terms 1+2) from Gross Harm (Terms 3+4), we can distinguish skill acquisition from spurious shifts, determining whether drift is Gain-dominant (emerging utility) or Harm-dominant (suppressed by interference).

3. Diagnose: Factor Analysis

While the 4-term decomposition explains what changes, each term is a product of three probabilities. The specific cause of a temporal shift remains unclear. For instance, a decline in Call Gain could result from: a shrinking failure set, lower calling probability, or degraded execution quality.

To pinpoint the root cause, we decompose each term into three factors: $$\text{Term}(\mathcal{D}, a, o) = \underbrace{P(\mathcal{D})}_{\text{Mass}} \cdot \underbrace{P(a \mid \mathcal{D})}_{\text{Policy}} \cdot \underbrace{P(o \mid a, \mathcal{D})}_{\text{Quality}}$$

Diagnostic Insights: This factorization uncovers two critical training dynamics: (1) Intrinsic-Tool Trade-off: As intrinsic capability improves, \(\mathcal{D}_{\text{fail}}\) shrinks, limiting the upper bound of Call Gain even if Quality improves; (2) Policy-Quality Decoupling: We distinguish learning to attempt (Policy) from learning to succeed (Quality).

Understanding the Results

MEASURE: The Dominance of Intrinsic Drift

Measure Figure
Quantifying Intrinsic and Tool-Induced Drift. We aggregate learning dynamics across six benchmarks (VStar, HR-Bench 4k/8k, VisualProbe Easy/Medium/Hard), evaluated every 80 gradient steps (21 checkpoints). The grey area (\(|B_{\mathrm{wo}}|\)) quantifies the cumulative magnitude of intrinsic drift (\(f_\mathrm{wo}\)). The colored area represents the magnitude of tool-induced drift (\(\Delta_{\mathrm{tool}}\)): Green indicates positive relative gain (\(f_w > f_{wo}\)), while red indicates negative relative drift (\(f_w < f_{wo}\)). Color intensity corresponds to the tool call rate. The top progress bar displays the tool contribution ratio (\(S_{tool}\)), i.e., the proportion of total drift magnitude attributed to tool effects.

Three Key Observations:

  1. Intrinsic drift dominates overall performance change. Contrary to common intuition, most performance gains are driven by improvements in intrinsic capability. The tool contribution ratio \(S_{\text{tool}}\) remains low (0.30 for Qwen2.5-VL and 0.22 for Qwen3-VL), indicating that over 70% of learning progress stems from intrinsic capability, independent of tool access.
  2. Relative drift diverges across initialization regimes. Models with no prior tool training (Qwen2.5-VL) show positive gain (\(f_w > f_{wo}\), green area), while models with prior tool training (Qwen3-VL) exhibit negative relative drift (\(f_{wo} > f_w\), red area). This does not imply forgetting, but rather a shift in reliance: the tool becomes less critical as intrinsic capabilities expand.
  3. Absolute performance improves monotonically. Despite negative relative drift in Qwen3-VL, absolute accuracy for both \(\text{Acc}_{w}\) and \(\text{Acc}_{wo}\) increases consistently.

EXPLAIN: Mitigating Harm Rather Than Maximizing Gain

Explain Figure
Decomposition of Tool-Induced Performance Gap \(G(t)\). Averaged across six benchmarks. The equation breaks down the net gap \(G(t)\) (yellow diamonds) into Gross Gain (green; T1+T2) and Gross Harm (red; T3+T4). Gross Gain consists of Call Gain (T1; intrinsic failures corrected via tool execution) and Schema Gain (T2; schema-only recovery without tool calls). Gross Harm consists of Call Harm (T3; intrinsic successes flipped to errors after tool calls) and Schema Harm (errors induced by the tool schema without calls). Key observation: Gross Gain stagnates while Gross Harm decreases consistently, indicating RL primarily reduces tool-induced harm rather than maximizing tool-based correction.

Detailed Analysis:

DIAGNOSE: Suppressing Errors Rather Than Enhancing Correction

Diagnose Figure
Factor Decomposition of Tool-Induced Effects. We show the temporal evolution of the four terms, factorized into Mass, Policy, and Quality. In each subplot, the thick line shows the term value (left axis), and thin lines show its factors (right axis): Mass (grey, \(P(\mathcal{D})\)), Policy (blue, \(P(a\mid\mathcal{D})\)), and Quality (orange, \(P(o\mid a,\mathcal{D})\)). Key findings: Limited failure correction (Call Gain quality shows little improvement on current failures), reduced breakage (Call Harm quality decreases), and schema interference mitigation (Schema Harm decreases).

Root Cause Analysis:

Robustness to the Moving Failure Set

Moving Failure Set
Robustness to the Moving Failure Set. The Call-Gain quality \(P(\checkmark \mid c, \mathcal{D}_{\text{fail}})\) evaluated under different failure-set definitions: the current failure set \(\mathcal{D}_{\text{fail}}(t)\) (Dynamic), the fixed initial cohort \(\mathcal{D}_{\text{fail}}(0)\) (Fixed), and persistent failures \(\mathcal{D}_{\text{fail}}(0)\cap \mathcal{D}_{\text{fail}}(t)\). Improvement is observed on the fixed cohort but remains limited on the current and persistent failure sets.

Because \(\mathcal{D}_{\text{fail}}(t)\) shifts in difficulty over training, we control for this by evaluating on: (i) a fixed initial failure cohort \(\mathcal{D}_{\text{fail}}(0)\), and (ii) persistent failures that remain unsolved. We find that quality gains do not extend to the hardest remaining failures.

What Does Vision Tool-Use RL Really Learn?

Synthesizing the above findings, we answer the central question. Contrary to the ideal of tool mastery, current vision tool-use RL learns a more conservative policy:

  1. Limited Contribution: Tool-induced effects remain a minor component (~20-30%) of overall improvement. While tool access contributes to performance, its effect is limited compared to intrinsic improvements.
  2. Interference Management: The model reduces Gross Harm by suppressing execution errors (\(P(\times \mid c, \mathcal{D}_{\text{succ}}) \downarrow\)) and mitigating schema distraction.
  3. Limited Failure Correction on Hard Cases: Call-Gain quality \(P(\checkmark \mid c,\mathcal{D}_{\text{fail}})\) shows little improvement on the current failure set and on persistent failures, indicating no strengthening of tool-based correction on instances that remain unsolved without tools.

Ultimately, the model learns to safely coexist with the tool rather than master it.

Conclusion

In this work, we present a systematic analysis of what vision tool-use RL actually learns. By disentangling intrinsic capability drift from tool-induced effects, and further decomposing tool utility into gain, harm, and their underlying mechanisms, we show that performance improvements are dominated by intrinsic learning rather than by tool-induced effects.

Across models and benchmarks, vision tool-use RL mainly reduces tool-induced harm, while showing limited improvement in tool contribution. Overall, vision tool-use RL learns a conservative policy for VLMs that makes tool availability less harmful, but does not reliably extend tool utility beyond the intrinsic hard core.

Limitations: The analysis focuses on a single vision tool (crop-and-zoom); more complex tools or multi-tool settings may exhibit different dynamics. We analyze outcome-only RL with sparse rewards; tool-aware reward shaping or additional supervision may produce stronger execution learning. Future work could incorporate efficiency, tool-use traces, and more interpretability metrics.

Citation

@article{ma2026does,
  title={What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom},
  author={Ma, Yan and Zhang, Weiyu and Li, Tianle and Du, Linge and Shen, Xuyang and Liu, Pengfei},
  journal={arXiv preprint arXiv:2602.01334},
  year={2026}
}

Website template adapted from V*: Guided Visual Search.