Efficient-Reasoner: Adaptive Compute Allocation via Reinforcement Learning

Overview

Efficient-Reasoner treats tool invocation as a learnable policy rather than a fixed pipeline component, training a language model to dynamically route between fast direct reasoning (System 1) and slow tool-augmented retrieval (System 2) via Group Relative Policy Optimization (GRPO). The core insight — inspired by Kahneman’s dual-process framework — is that most LLM agent systems invoke external tools unconditionally, but a significant fraction of queries can be answered from parametric memory alone. By designing a multi-component reward function that penalizes tool usage (−0.05 per call) while bonusing correct direct answers (+0.1), the GRPO-trained 3B-parameter model discovers a cost-benefit decision boundary purely from the reward signal, maintaining ~65% accuracy while reducing token generation by 40% and tool calls by 60% compared to always-search baselines. The system implements a complete two-phase pipeline (SFT cold start → GRPO optimization) with a sub-millisecond mock retrieval environment enabling >1,000 RL training steps/hour, Pareto frontier analysis for accuracy-vs-cost trade-off visualization, and 4-bit QLoRA training on consumer GPUs (8GB+ VRAM).

Dual-Process Reasoning Architecture

The model generates structured XML output following an agentic Think → Call → Observe → Answer protocol:

<thought>Step-by-step reasoning about whether retrieval is needed...</thought>
<call>search_wiki("query")</call>
<obs>Retrieved information from knowledge base...</obs>
<answer>Final answer</answer>

The system prompt instructs the model to think before searching — if confident, output <answer> directly (System 1); if uncertain, invoke <call> for retrieval (System 2). After GRPO training, this routing behavior emerges naturally from the reward signal without explicit complexity classification — the model learns to skip retrieval for common knowledge questions and invoke it for obscure factual queries.

Custom stopping criteria (StopOnXMLTags) monitors generation for </call> and </answer> tags, enabling mid-generation tool execution: when </call> is detected, generation pauses, the tool executes, results are injected as <obs> tags into context, and generation resumes.

Two-Phase Training Pipeline

Phase 1: SFT Cold Start (~30 min)

Format-tuning phase teaching the model the XML tool-call protocol. Synthetic reasoning traces are generated from HotpotQA data with a controlled distribution:

Trace Type	Distribution	Pattern
Direct answer	20%	`<thought>` → `<answer>` (no tool calls)
Single-search	60%	`<thought>` → `<call>` → `<obs>` → `<answer>`
Multi-hop	20%	Multiple `<call>`/`<obs>` pairs before `<answer>`

A heuristic is_simple_question() classifier routes questions to the appropriate trace type. Training uses SFTTrainer from TRL with lr=2e-5, batch size 4 × 2 gradient accumulation, 3 epochs, max sequence length 1,024.

Phase 2: GRPO Optimization (~2–4 hrs, A100)

The core policy optimization phase where the model learns when to use tools.

Parameter	Value
Algorithm	GRPO (Group Relative Policy Optimization)
Group size	8 completions per prompt
KL penalty (β)	0.04
Learning rate	`2e-5` with cosine decay + 10% warmup
Batch size	4 × 4 gradient accumulation = 16 effective
Max steps	500
Max prompt length	1,024 tokens
Max completion length	1,024 tokens
Checkpoint interval	Every 100 steps (max 3 kept)
Precision	bf16 (auto-detected)

Why GRPO over PPO: GRPO eliminates the critic/value network entirely — instead of training a separate value head, advantages are computed relative to the group of 8 completions per prompt. This is more parameter-efficient (no critic parameters) and more stable for sequence-level rewards where per-token value estimation is unreliable.

Multi-Component Reward Function

The reward function encodes the cost-benefit trade-off between accuracy and computational efficiency:

R = Correctness + Format − Cost − IncompletePenalty + EfficientBonus

Component	Condition	Value	Purpose
Correctness	Answer matches ground truth	+1.0	Primary accuracy signal
Correctness	Wrong answer	−0.5	Penalty for incorrect responses
Format	Valid XML structure	+0.1	Encourage parseable output
Format	Invalid XML	−0.5	Penalize unparseable output
Tool Cost	Per tool call executed	−0.05	Core efficiency pressure
Incomplete	`<call>` without matching `<obs>`	−0.2	Penalize unfinished reasoning
Efficient Bonus	Correct without any tools	+0.1	Reward confident direct answers

The −0.05 per-call cost and +0.1 efficient bonus create a clear incentive gradient: for questions answerable from parametric memory, direct answers yield +1.0 + 0.1 + 0.1 = +1.2 reward, while unnecessary single-search answers yield +1.0 + 0.1 − 0.05 = +1.05. The 0.15 margin is sufficient for GRPO to learn the routing policy.

Correctness assessment uses fuzzy matching with multiple strategies: exact match (after normalization), case-insensitive comparison, containment detection, and word-level subset matching.

Mock Retrieval Environment

A sub-millisecond knowledge base simulator enabling fast RL training without API latency:

Three-tier retrieval cascade:

O(1) exact match — Dictionary lookup on normalized titles
O(N) partial match — Substring search across all keys
O(N) content search — Full-text search across paragraph content

Results are truncated to 500 characters maximum. The environment supports single-query (search_wiki), batch multi-query (search_wiki_multi), and entity lookup (get_entity_info) tool functions. Data is sourced from HotpotQA’s distractor subset (pre-built JSON indices or freshly constructed).

Performance impact: Sub-millisecond latency enables >1,000 GRPO training steps/hour, compared to ~100 steps/hour with live Wikipedia API calls — a 10× throughput improvement that makes RL training on agentic reasoning tractable.

Model Architecture

Component	Specification
Base Model	Qwen2.5-3B-Instruct (3B parameters)
Quantization	4-bit NF4 with double quantization via bitsandbytes (~75% VRAM reduction)
LoRA Rank / Alpha	16 / 32 (alpha/r = 2.0)
LoRA Targets	All 7 projections: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
LoRA Dropout	0.05
Trainable parameters	~0.5% of total
Optimizer	8-bit AdamW (memory-efficient)
Gradient checkpointing	Enabled

Graceful degradation chain: Unsloth (4× faster training) → standard Transformers + bitsandbytes → CPU float32. Device detection: CUDA → MPS (Apple Silicon) → CPU.

Agentic Inference Loop

The inference system implements a Think → Call → Execute → Observe → Resume cycle with 5 safety mechanisms:

Max agentic steps: 5 (configurable) — bounds the reasoning depth
Per-step token budget: 256 new tokens maximum
Stuck detection: N-gram diversity analysis via detect_stuck_generation() — if diversity drops below threshold within 50 tokens, generation is terminated
Context length monitoring: 80% max context ratio triggers warning/truncation with 256-token safety margin
Unclosed tag detection: Regex-based detection of incomplete <call> and <obs> tags

Results

Configuration	Accuracy	Avg Tokens	Avg Tool Calls
Base (zero-shot)	~40%	~200	~1.5
SFT (always search)	~65%	~450	~2.0
GRPO (learned policy)	~65%	~270	~0.8

Key findings: (1) GRPO matches SFT accuracy while using 40% fewer tokens and 60% fewer tool calls, validating that tool-use is a learnable policy. (2) The model exhibits emergent dual-process behavior — fast direct answers for common knowledge, deliberate multi-hop retrieval for obscure facts — purely from the reward signal without explicit complexity classification. (3) Base model’s ~1.5 tool calls reflect uncontrolled, often malformed attempts, demonstrating that SFT is necessary for format learning before GRPO can optimize the policy.

Pareto Frontier Analysis

The evaluation framework generates accuracy-vs-tokens Pareto frontier visualizations via dominance checking (no point simultaneously higher accuracy and lower tokens). Models are color-coded (base: red, SFT: teal, GRPO: blue) with the “efficient zone” (top-left quadrant) highlighted. The GRPO checkpoint occupies the Pareto-optimal position — maximum accuracy at minimum computational cost.

Ablation support: Multi-checkpoint evaluation (eval/benchmark.py) automatically compares base, SFT, and GRPO configurations with result caching, GPU memory monitoring, and per-sample detailed results for post-hoc analysis.

Tech Stack

Python (3.10+), PyTorch (2.4+), Hugging Face Transformers (4.44+), TRL (GRPOTrainer, SFTTrainer), PEFT (LoRA), bitsandbytes (NF4), Accelerate, Unsloth (optional 4× speedup), Datasets (HotpotQA), Matplotlib, NumPy