MED: Measure-Explain-Diagnose Framework

Introduction

Vision tool-use RL equips vision-language models with explicit visual operators (e.g., crop-and-zoom) and trains them to invoke tools during reasoning. Empirically, this yields sizeable performance gains on multimodal benchmarks. However, a central question remains insufficiently understood: what does vision tool-use RL actually learn?

Performance improvements may arise from three distinct sources: (1) strengthening the model's intrinsic capability (better perception even without tools); (2) improving tool use itself (better when-to-call decisions and execution quality); (3) reducing tool-induced side effects (fewer harmful calls, less schema interference). Existing evaluations typically report end-to-end tool-available accuracy, thereby hindering a mechanistic attribution of the gains.

We introduce MED (Measure-Explain-Diagnose), a coarse-to-fine analysis framework to disentangle intrinsic capability changes from tool-induced effects. MED quantifies how much performance change comes from intrinsic improvement, decomposes tool-induced effects into gain and harm components, and diagnoses the underlying mechanisms driving these changes.

Overview

Reinforcement learning (RL)-based post-training has recently been extended to multimodal settings, where vision-language models (VLMs) are equipped with visual operators such as crop-and-zoom to enable interactive perception. While this paradigm achieves strong performance gains on multimodal benchmarks, it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities.

The MED framework provides a systematic answer by disentangling intrinsic capability changes from tool-induced effects, decomposing the tool-induced performance difference into gain and harm terms, and probing the mechanisms driving their evolution through checkpoint-level analysis.

Key Findings

Performance gains are primarily driven by intrinsic learning - Models improve their base reasoning capabilities
Tool-use RL mainly reduces tool-induced harm - Reduces errors from tool invocation and weakens tool pattern interference
Limited improvement in tool-based correction - Tools don't significantly improve correction of intrinsic failures
Current vision RL learns to "safely coexist with tools" - Rather than fully mastering their strategic use

The MED Framework

The MED framework provides a coarse-to-fine analysis of vision tool-use reinforcement learning through three sequential steps:

1. Measure: Quantifying Drift Components

We measure progress as the change in accuracy from the initial checkpoint. The drift is defined as: $$f_{\mathrm{wo}}(t) = \mathrm{Acc}_{\mathrm{wo}}(t) - \mathrm{Acc}_{\mathrm{wo}}(0), \quad f_{\mathrm{w}}(t) = \mathrm{Acc}_{\mathrm{w}}(t) - \mathrm{Acc}_{\mathrm{w}}(0)$$ where $f_{\mathrm{wo}}(t)$ measures intrinsic capability change (tool-free), and $f_{\mathrm{w}}(t)$ measures end-to-end change when tool use is available.

We define the tool-induced performance gap at checkpoint $t$: $$G(t) \triangleq \mathrm{Acc}_{\mathrm{w}}(t) - \mathrm{Acc}_{\mathrm{wo}}(t)$$ The evolution of this gap gives the tool-induced drift: $\Delta_{\mathrm{tool}}(t) \triangleq G(t) - G(0)$. This yields an additive decomposition of tool-available drift: $$\underbrace{f_{\mathrm{w}}(t)}_{\text{Tool-available drift}} = \underbrace{f_{\mathrm{wo}}(t)}_{\text{Intrinsic drift}} + \underbrace{\Delta_{\mathrm{tool}}(t)}_{\text{Tool-induced drift}}$$

To summarize contributions over training, we measure the cumulative magnitude of each drift component: $$|B_{\mathrm{wo}}| = \int_{0}^{T} | f_{\mathrm{wo}}(t) |\,dt, \quad |B_{\Delta \mathrm{tool}}| = \int_{0}^{T} | \Delta_{\mathrm{tool}}(t) |\,dt$$ The tool contribution ratio is the fraction of total drift magnitude attributed to tool effects: $$S_{\mathrm{tool}} = \frac{|B_{\Delta \mathrm{tool}}|}{|B_{\mathrm{wo}}| + |B_{\Delta \mathrm{tool}}|}$$ When $S_{\mathrm{tool}} \approx 0$, intrinsic drift dominates; when $S_{\mathrm{tool}} \approx 1$, tool-induced drift dominates.

2. Explain: 4-Term Decomposition

While $S_{\text{tool}}$ quantifies the overall tool-induced drift magnitude, it does not explain the underlying dynamics. To gain deeper understanding, we decompose the performance gap $G(t)$ based on the model's intrinsic capability.

Partitioning via Intrinsic Capability: At each checkpoint $t$, intrinsic performance partitions the task set into two disjoint subsets: the failure set $\mathcal{D}_{\text{fail}}(t)$ (where the model fails without tools) and the success set $\mathcal{D}_{\text{succ}}(t)$ (where it succeeds). This defines the potential for improvement on $\mathcal{D}_{\text{fail}}$ versus regression on $\mathcal{D}_{\text{succ}}$.

By conditioning on tool usage ($c$: calling the tool; $\checkmark$/$\times$: correct/incorrect prediction), we obtain a four-term decomposition of $G(t)$:

$$\begin{aligned} G(t) = & \underbrace{P(\mathcal D_{\text{fail}}) P(c \mid \mathcal D_{\text{fail}}) P(\checkmark \mid c, \mathcal D_{\text{fail}})}_{\text{Term 1: Call Gain}} \\ & + \underbrace{P(\mathcal D_{\text{fail}}) P(\neg c \mid \mathcal D_{\text{fail}}) P(\checkmark \mid \neg c, \mathcal D_{\text{fail}})}_{\text{Term 2: Schema Gain}} \\ & - \underbrace{P(\mathcal D_{\text{succ}}) P(c \mid \mathcal D_{\text{succ}}) P(\times \mid c, \mathcal D_{\text{succ}})}_{\text{Term 3: Call Harm}} \\ & - \underbrace{P(\mathcal D_{\text{succ}}) P(\neg c \mid \mathcal D_{\text{succ}}) P(\times \mid \neg c, \mathcal D_{\text{succ}})}_{\text{Term 4: Schema Harm}} \end{aligned}$$

Call Gain (Term 1): Intrinsic failures corrected by tool execution — quantifies tool execution contribution
Schema Gain (Term 2): Intrinsic failures recovered under tool schema without invocation — beneficial side-effects of the tool prompt
Call Harm (Term 3): Intrinsic successes lost due to tool calls — harm caused by invoking tools
Schema Harm (Term 4): Intrinsic successes lost under tool schema without invocation — harmful side-effects of the tool prompt

By isolating Gross Gain (Terms 1+2) from Gross Harm (Terms 3+4), we can distinguish skill acquisition from spurious shifts, determining whether drift is Gain-dominant (emerging utility) or Harm-dominant (suppressed by interference).

3. Diagnose: Factor Analysis

While the 4-term decomposition explains what changes, each term is a product of three probabilities. The specific cause of a temporal shift remains unclear. For instance, a decline in Call Gain could result from: a shrinking failure set, lower calling probability, or degraded execution quality.

To pinpoint the root cause, we decompose each term into three factors: $$\text{Term}(\mathcal{D}, a, o) = \underbrace{P(\mathcal{D})}_{\text{Mass}} \cdot \underbrace{P(a \mid \mathcal{D})}_{\text{Policy}} \cdot \underbrace{P(o \mid a, \mathcal{D})}_{\text{Quality}}$$

Mass ($M$): Domain size $P(\mathcal{D})$ — the capacity available for the tool to generate gain or harm
Policy ($\pi$): Calling probability $P(a \mid \mathcal{D})$ — the model's decision-making strategy ("When to call")
Quality ($Q$): Success rate $P(o \mid a, \mathcal{D})$ — the model's execution capability ("How to use")

Diagnostic Insights: This factorization uncovers two critical training dynamics: (1) Intrinsic-Tool Trade-off: As intrinsic capability improves, $\mathcal{D}_{\text{fail}}$ shrinks, limiting the upper bound of Call Gain even if Quality improves; (2) Policy-Quality Decoupling: We distinguish learning to attempt (Policy) from learning to succeed (Quality).

Understanding the Results

MEASURE: The Dominance of Intrinsic Drift

Measure Figure — **Quantifying Intrinsic and Tool-Induced Drift.** We aggregate learning dynamics across six benchmarks (VStar, HR-Bench 4k/8k, VisualProbe Easy/Medium/Hard), evaluated every 80 gradient steps (21 checkpoints). The grey area ($|B_{\mathrm{wo}}|$) quantifies the cumulative magnitude of intrinsic drift ($f_\mathrm{wo}$). The colored area represents the magnitude of tool-induced drift ($\Delta_{\mathrm{tool}}$): Green indicates positive relative gain ($f_w > f_{wo}$), while red indicates negative relative drift ($f_w < f_{wo}$). *Color intensity* corresponds to the tool call rate. The *top progress bar* displays the tool contribution ratio ($S_{tool}$), i.e., the proportion of total drift magnitude attributed to tool effects.

Three Key Observations:

Intrinsic drift dominates overall performance change. Contrary to common intuition, most performance gains are driven by improvements in intrinsic capability. The tool contribution ratio $S_{\text{tool}}$ remains low (0.30 for Qwen2.5-VL and 0.22 for Qwen3-VL), indicating that over 70% of learning progress stems from intrinsic capability, independent of tool access.
Relative drift diverges across initialization regimes. Models with no prior tool training (Qwen2.5-VL) show positive gain ($f_w > f_{wo}$, green area), while models with prior tool training (Qwen3-VL) exhibit negative relative drift ($f_{wo} > f_w$, red area). This does not imply forgetting, but rather a shift in reliance: the tool becomes less critical as intrinsic capabilities expand.
Absolute performance improves monotonically. Despite negative relative drift in Qwen3-VL, absolute accuracy for both $\text{Acc}_{w}$ and $\text{Acc}_{wo}$ increases consistently.

EXPLAIN: Mitigating Harm Rather Than Maximizing Gain

Explain Figure — **Decomposition of Tool-Induced Performance Gap $G(t)$.** Averaged across six benchmarks. The equation breaks down the net gap $G(t)$ (yellow diamonds) into *Gross Gain* (green; T1+T2) and *Gross Harm* (red; T3+T4). Gross Gain consists of Call Gain (T1; intrinsic failures corrected via tool execution) and Schema Gain (T2; schema-only recovery without tool calls). Gross Harm consists of Call Harm (T3; intrinsic successes flipped to errors after tool calls) and Schema Harm (errors induced by the tool schema without calls). **Key observation:** Gross Gain stagnates while Gross Harm decreases consistently, indicating RL primarily reduces tool-induced harm rather than maximizing tool-based correction.

Detailed Analysis:

The Stagnation of Gross Gain. Call Gain (Term 1)—intrinsic failures corrected via tool calls—does not keep increasing. Instead, it plateaus after an early rise or declines monotonically, indicating saturation in the model's ability to extract additional tool-based gains.
The Consistent Reduction of Gross Harm. Gross Harm decreases steadily across both models, driven by reduced Schema Harm (Qwen2.5-VL) or Call Harm (Qwen3-VL). This indicates improved robustness to schema interference and fewer harmful tool invocations.
The Counterbalancing of Gain and Harm. The net performance gap $G(t)$ plateaus because reduced Gross Harm is offset by saturated/declining Gross Gain. Harm decreases, but gain does not increase, keeping $G(t)$ roughly constant.

DIAGNOSE: Suppressing Errors Rather Than Enhancing Correction

Diagnose Figure — **Factor Decomposition of Tool-Induced Effects.** We show the temporal evolution of the four terms, factorized into Mass, Policy, and Quality. In each subplot, the **thick line** shows the term value (left axis), and **thin lines** show its factors (right axis): *Mass* (grey, $P(\mathcal{D})$), *Policy* (blue, $P(a\mid\mathcal{D})$), and *Quality* (orange, $P(o\mid a,\mathcal{D})$). **Key findings:** Limited failure correction (Call Gain quality shows little improvement on current failures), reduced breakage (Call Harm quality decreases), and schema interference mitigation (Schema Harm decreases).

Root Cause Analysis:

Limited failure correction, but reduced breakage on successes. For Call Gain (T1), the correction success rate on intrinsic failures $P(\checkmark\mid c,\mathcal{D}_{\text{fail}}(t))$ remains flat or declines. In contrast, for Call Harm (T3), the breakage rate on intrinsic successes $P(\times\mid c,\mathcal{D}_{\text{succ}}(t))$ decreases consistently. RL primarily suppresses tool-induced errors on already-solved instances rather than strengthening tool-based failure correction.
RL mitigates the interference of tool schema. Models with prior tool exposure show minimal Schema Gain/Harm, while models new to the tool initially show sensitivity. Over training, RL drives down both Schema Gain and Schema Harm, reducing sensitivity to the tool schema.

Robustness to the Moving Failure Set

Because $\mathcal{D}_{\text{fail}}(t)$ shifts in difficulty over training, we control for this by evaluating on: (i) a fixed initial failure cohort $\mathcal{D}_{\text{fail}}(0)$, and (ii) persistent failures that remain unsolved. We find that quality gains do not extend to the hardest remaining failures.

What Does Vision Tool-Use RL Really Learn?

Synthesizing the above findings, we answer the central question. Contrary to the ideal of tool mastery, current vision tool-use RL learns a more conservative policy:

Limited Contribution: Tool-induced effects remain a minor component (~20-30%) of overall improvement. While tool access contributes to performance, its effect is limited compared to intrinsic improvements.
Interference Management: The model reduces Gross Harm by suppressing execution errors ($P(\times \mid c, \mathcal{D}_{\text{succ}}) \downarrow$) and mitigating schema distraction.
Limited Failure Correction on Hard Cases: Call-Gain quality $P(\checkmark \mid c,\mathcal{D}_{\text{fail}})$ shows little improvement on the current failure set and on persistent failures, indicating no strengthening of tool-based correction on instances that remain unsolved without tools.

Ultimately, the model learns to safely coexist with the tool rather than master it.

Conclusion

In this work, we present a systematic analysis of what vision tool-use RL actually learns. By disentangling intrinsic capability drift from tool-induced effects, and further decomposing tool utility into gain, harm, and their underlying mechanisms, we show that performance improvements are dominated by intrinsic learning rather than by tool-induced effects.

Across models and benchmarks, vision tool-use RL mainly reduces tool-induced harm, while showing limited improvement in tool contribution. Overall, vision tool-use RL learns a conservative policy for VLMs that makes tool availability less harmful, but does not reliably extend tool utility beyond the intrinsic hard core.

Limitations: The analysis focuses on a single vision tool (crop-and-zoom); more complex tools or multi-tool settings may exhibit different dynamics. We analyze outcome-only RL with sparse rewards; tool-aware reward shaping or additional supervision may produce stronger execution learning. Future work could incorporate efficiency, tool-use traces, and more interpretability metrics.

Citation

@article{ma2026does,
  title={What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom},
  author={Ma, Yan and Zhang, Weiyu and Li, Tianle and Du, Linge and Shen, Xuyang and Liu, Pengfei},
  journal={arXiv preprint arXiv:2602.01334},
  year={2026}
}

What Does Vision Tool-Use Reinforcement Learning Really Learn?
Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Authors

Publication

Date

Introduction

Overview

Key Findings

The MED Framework

1. Measure: Quantifying Drift Components

2. Explain: 4-Term Decomposition

3. Diagnose: Factor Analysis

Understanding the Results

MEASURE: The Dominance of Intrinsic Drift

EXPLAIN: Mitigating Harm Rather Than Maximizing Gain

DIAGNOSE: Suppressing Errors Rather Than Enhancing Correction

Robustness to the Moving Failure Set

What Does Vision Tool-Use RL Really Learn?

Conclusion

Citation

What Does Vision Tool-Use Reinforcement Learning Really Learn?Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Authors

Publication

Date

Introduction

Overview

Key Findings

The MED Framework

1. Measure: Quantifying Drift Components

2. Explain: 4-Term Decomposition

3. Diagnose: Factor Analysis

Understanding the Results

MEASURE: The Dominance of Intrinsic Drift

EXPLAIN: Mitigating Harm Rather Than Maximizing Gain

DIAGNOSE: Suppressing Errors Rather Than Enhancing Correction

Robustness to the Moving Failure Set

What Does Vision Tool-Use RL Really Learn?

Conclusion

Citation

What Does Vision Tool-Use Reinforcement Learning Really Learn?
Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom