Learning from Failures: Error-Driven Reinforcement Learning for Tool Use

Ziyi Wang1 · Yuxuan Lu1 · Dakuo Wang1 · et al.
1 Northeastern University
Paper coming soon Code coming soon Data coming soon Leaderboard
TL;DR

We propose RAFT (Reinforcement from Agent Failure Tasks), turning tool-use agent failed trajectories into targeted executable tasks and building an error-driven RL pipeline, achieving 89.25 Pass^1 with RAFT-27B and 82.5 Pass^1 with RAFT-30B-A3B on Tau2-Bench Retail.

Latest Result Updated: June 25, 2026

RAFT-27B — Base model: Qwen3.6-27B, fine-tuned on synthetic retail-domain trajectories

89.25
Pass^1
84.76
Pass^2
81.36
Pass^3
78.95
Pass^4

Background

Tool-use agents are systems that autonomously call APIs, query databases, and execute multi-step workflows in response to natural-language instructions, and are increasingly central to real-world AI deployment. Training them well requires more than teaching a model which tools exist: the agent must ground tool arguments precisely, reason across multiple records, and maintain coherent plans over long interaction horizons. Supervised fine-tuning on synthetic trajectories gets you most of the way there, but consistently leaves a long tail of failure modes that imitation alone cannot fix. This work asks how to close that gap efficiently — without simply generating ever more generic training data.

A Failure Most Agents Make

After SFT, agents fail in stubbornly specific ways — not because they lack tool knowledge, but because they're confidently wrong in the edge cases imitation learning never covers. Here is a real failure mode we saw repeatedly. The user has two open orders, both containing similar items:

User "Can you cancel the blue mug from my recent order?"
Agent (calls cancel_item on Order #A — the wrong order, which also contains a blue mug)
Agent "Done — your blue mug has been cancelled."

The agent understood the high-level intent. It knew the right tool. It generated a syntactically valid call. And it still failed, because resolving "my recent order" requires asking the user, or at minimum disambiguating between two candidate records.

This kind of failure — confident, plausible, and wrong — is exactly the kind that imitation learning struggles to fix. The SFT data covers the easy path; the model only encounters two-order ambiguity in its own rollouts, where there's no teacher to correct it. That observation is the seed of this work.

Why SFT Alone Hits a Ceiling

The standard recipe for tool-use agents is supervised fine-tuning on synthetic interaction trajectories — successful rollouts of an agent, a user simulator, and an executable environment. It works: it's how you get a model that knows tool schemas, generates valid arguments, and tracks state across turns.

But SFT is fundamentally off-policy: the model imitates trajectories produced by some other policy, not the states it actually visits when it's the one driving. In multi-turn tool use, small mistakes compound into states that supervised data barely covers. SFT builds broad competence, then leaves the model alone exactly where it most needs help.

Reinforcement learning on the executable environment is the natural next step. The catch: RL is only as good as its task distribution. If the training tasks don't target the model's actual weaknesses, RL just rehearses what SFT already taught, or worse, finds reward-hacking shortcuts. So the real question after SFT isn't more RL — it's which tasks.

Method

We present RAFT (Reinforcement from Agent Failure Tasks), an error-driven RL pipeline that converts failed tool-use trajectories into targeted executable tasks and trains the agent to correct its own mistakes. The loop has four stages:

RAFT Pipeline

Figure 1: The RAFT pipeline — SFT bootstrap → error discovery → targeted task generation → RL training, with the loop feeding new tasks back into the pool.

Stage 1

SFT Bootstrap

Fine-tune on synthetic Tau-bench-style trajectories. Each trajectory is a full multi-turn record — observations, tool calls, environment feedback, and reasoning — so the model learns the mechanics: which tool to pick, how to follow schemas, how to keep state across turns. This stage exists to give RL a stable, reasonably competent starting policy.

Stage 2

Mining Failure Patterns

Run the SFT model against validation tasks and examine the failures. For each losing trajectory, an LLM judge reads the user request, dialogue history, tool calls, and tool outputs, and answers one question: what went wrong? The explanations are then clustered into recurring patterns.

Three families dominate the post-SFT failure landscape on Retail:

PatternWhat it looks like
Item & argument selection Right intent, wrong specifics: picks the wrong product variant, mixes item IDs across orders, or hands the tool an invalid argument value.
Multi-order reasoning The user has multiple active or historical orders; the model updates the wrong one or attributes an item to the wrong record.
Execution control Recognizes that confirmation is needed but never confirms. Terminates before all required steps. Calls the same tool repeatedly without making progress.

These are not "the model needs more data" failures — they're a precise diagnosis: the model needs to be challenged on precise grounding, multi-record reasoning, and long-horizon execution.

Stage 3

Synthesizing Targeted Tasks

For each error pattern, generate new RL tasks that deliberately provoke it — while preserving a feasible solution path so the reward signal is well-defined. A strong LLM agent first explores the executable environment to produce grounded successful trajectories. A second LLM converts each trajectory into a natural-language user task, injecting a chosen difficulty pattern into the wording.

Worked example — multi-order ambiguity injection

Source trajectory: User has Orders #A (blue mug, ceramic vase) and #B (blue mug, frying pan). Agent successfully cancels the blue mug from Order #A.

Naïve generated task: "Cancel the blue mug from Order #A."

Targeted generated task: "Hey, can you cancel the blue mug from my recent order? I think it was the one with the vase too."

The feasible solution still exists, but the agent must now either resolve the order from the side hint, or — better — ask a clarifying question before acting. The difficulty isn't bolted on by hand-written rules; it's shaped by what the model actually got wrong last round.

Stage 4

RL with Anti-Hacking Rewards

Train with GRPO against the executable environment. Task-success reward is the backbone, but pure outcome rewards on tool-use environments are easy to game, so three guardrails are added:

  • A repeated-call penalty — discourages tool-call loops and cosmetic "progress."
  • Action-level reward checks — verify intermediate tool calls are valid and aligned with the task, not just the final state.
  • A token-length penalty — keeps reasoning traces from ballooning.

Together these keep the policy honest: it must solve the task, with valid intermediate steps, in a reasonable budget.

Implementation note: In our current setup, every RL round restarts from the same SFT checkpoint and trains on the union of all targeted tasks accumulated so far. What grows between rounds is the task pool; the policy always launches from the same well-conditioned starting point. This isn't the only valid choice — continuing from the previous RL checkpoint or weighting recent tasks more heavily are alternatives worth exploring — but it makes runs reproducible and isolates "the new tasks helped" from "the model just got more gradient steps."

Results

We evaluate on Tau2-Bench Retail with Qwen3-30B-A3B-Thinking-2507 as the base model. RL training tasks are generated from databases that share Retail's tool schemas and rules but whose users, orders, and products are disjoint from the test environment. We report Pass^k: the probability that all k repeated trials succeed.

82.5
+6.6 vs SFT
Pass^1
73.0
+8.5 vs SFT
Pass^2
66.2
+8.7 vs SFT
Pass^3
61.4
+8.8 vs SFT
Pass^4

Pass^k results on Tau2-Bench Retail (Qwen3-30B-A3B-Thinking-2507). RAFT (SFT + RL) consistently outperforms both the base model and SFT baseline. Higher is better; Pass^k is the probability that all k independent trials succeed.

Two things stand out. SFT does most of the work — it alone closes about three-quarters of the gap between Base and the final model on Pass^1. Error-driven RL widens the gap most where it counts: the Pass^1 lift from RL is +6.6, but the Pass^4 lift is +8.8. RL helps more on the strict consistency metric than on the lenient one — exactly what you'd expect if RL is working on the long tail of failure modes rather than the average case.

Leaderboard Comparison

Our 82.5 Pass^1 places a 3B-active-parameter model between Qwen3.5-397B-A17B (84.4) and GPT-5.2 (81.6) on the public τ²-Bench Retail leaderboard, ahead of Claude Opus 4.5, Gemini 3 Pro/Flash, and GLM-5. The base model is a mixture-of-experts with ~3B active parameters; the leaderboard leader has 17B active out of 397B total.

Pass^1 on τ²-Bench Retail. Our SFT+RL Qwen3-30B-A3B (3B active) slots between Qwen3.5-397B-A17B and GPT-5.2. Leaderboard models are zero-shot evaluations of general-purpose systems; our model is domain-specialized.

We're not claiming "small model beats big model" in any absolute sense. We are saying something more specific: on a well-defined tool-use domain, a few hundred targeted RL tasks generated from the model's own failures can close the gap that scale alone would otherwise pay for.

Reward Design Matters As Much As the Task Pool

We ablated the reward design to see how much each piece contributes. Same model, same task pool — only the reward changes.

Reward Pass^1 Pass^2 Pass^3 Pass^4
Full reward (ours) 82.5 73.0 66.2 61.4
− repeated-tool-call penalty 79.2 66.4 57.2 50.0
Task-success only (sparse) 66.7

The repeated-tool-call penalty is mostly a consistency knob. Removing it costs 3.3 points on Pass^1 but 11.4 points on Pass^4. The penalty discourages the model from spamming the same tool call when stuck — that degenerate loop hurts consistency far more than average behavior, because every now-and-then a stuck rollout tanks the "all k trials succeed" criterion.

Sparse outcome rewards collapse. Stripping everything to "did the task succeed" drops Pass^1 to 66.7 — worse than SFT alone. The model exploited the environment exactly as the auxiliary rewards were designed to prevent: looping on the same tool call, padding reasoning traces, stopping following its own plans.

The high-level lesson: the task pool tells the model what to learn; the auxiliary rewards tell it what not to do. You need both.

What This Means for Tool-Use Agents

The traditional advice is "scale up your synthetic data." This work suggests a complementary lever: let the model's failure distribution shape your training distribution. SFT builds breadth; error-driven RL closes specific gaps. They're not substitutes — they're stages.

Limitations and What's Next

We've shown the recipe on one domain (Retail) and one base model (Qwen3-30B-A3B-Thinking). The error taxonomy that fell out — item selection, multi-order reasoning, execution control — is Retail-flavored, even though we suspect the types of pattern (precise grounding, multi-record reasoning, long-horizon execution) generalize. Obvious next steps:

  1. Run the loop on Airline and Telecom domains to see whether the error taxonomy is domain-shaped or capability-shaped.
  2. Run multiple rounds of iteration and track how the error distribution shifts or eventually saturates.
  3. Push the same recipe on smaller base models, where post-SFT failures are likely concentrated in different patterns and the loop should be most informative.

If you're training tool-use agents and your post-SFT model still fails on things you "thought were obvious," try clustering its failures before generating more data. The model already knows what it's bad at — you just have to ask it.