Tool-Use DPO: Schema-Constrained Alignment via Identity Preference Optimization
Overview
Tool-Use DPO addresses a critical production failure mode: LLMs generating tool calls that are syntactically valid JSON but violate the API schema — wrong types ("5" instead of 5), hallucinated parameters, invalid enum values, missing required fields. The system implements a two-stage pipeline (SFT cold start → DPO with Identity Preference Optimization) trained on GPT-4o-generated hard negative preference pairs — rejected responses that are valid JSON but fail schema validation in exactly the ways real models fail. Using Qwen2.5-Coder-7B-Instruct with RSLoRA (rank 32, all 7 projection modules), the method achieves a 3.1× improvement in Strict Schema Pass Rate (7.48% → 23.19%) with 67% fewer JSON syntax errors and 22% fewer missing required fields on 802 diverse tool-calling samples.
The Hard Negative Philosophy
The core insight: easy negatives don’t teach discrimination. Malformed JSON or gibberish is trivially distinguishable — models already avoid obvious garbage. Instead, the training data contains rejected responses that are plausible, syntactically correct JSON that fails in precisely the four ways production models fail:
| Error Category | Example |
|---|---|
| Type Mismatch | "limit": "10" instead of "limit": 10 |
| Enum Violation | "priority": "urgent" when schema allows ["high", "medium", "low"] |
| Hallucinated Parameter | Adding "verbose": true when no such field exists in the schema |
| Missing Required | Omitting a mandatory "title" field |
Each training triplet (prompt, chosen, rejected) carries its own JSON Schema, enabling the model to generalize across diverse API contracts rather than memorizing a fixed schema.
Synthetic Data Generation
GPT-4o generates triplets with a structured system prompt enforcing the hard negative constraint. Every generated sample passes through a strict validation gate before inclusion:
chosen_valid, _ = validate_tool_call(chosen_str, schema) # Must pass
rejected_valid, _ = validate_tool_call(rejected_str, schema) # Must fail
if chosen_valid and not rejected_valid:
accept(sample) # Correct: chosen passes, rejected fails
else:
discard(sample) # GPT-4o made an error — self-correcting pipeline
Validation uses the jsonschema library with full Draft 4/6/7 support, including additionalProperties checks. The extract_json() utility handles models wrapping output in markdown code blocks or embedding JSON in surrounding text via brace-depth matching.
Two-Stage Training Pipeline
Stage 1: SFT Cold Start
Format-tuning phase teaching the model to output raw JSON with proper ChatML structure (<|im_start|>/<|im_end|> delimiters). Uses only (prompt, chosen) pairs from the DPO triplets.
| Parameter | Value |
|---|---|
| Learning rate | 3e-5 |
| Epochs | 6 |
| Batch size | 8 × 2 gradient accumulation = 16 effective |
| Warmup | 100 steps |
| Optimizer | adamw_8bit |
Why SFT first? DPO alone struggles without foundational format knowledge. The SFT stage ensures the model can reliably produce JSON before learning which JSON to prefer — without this, the DPO signal is too noisy to distinguish “bad format” from “bad content.”
Stage 2: DPO with IPO Loss
Preference optimization using the SFT checkpoint as initialization.
| Parameter | Value |
|---|---|
| Learning rate | 1e-6 (30× smaller than SFT) |
| Epochs | 3 |
| Batch size | 4 × 4 gradient accumulation = 16 effective |
| Beta | 0.05 |
| Loss type | IPO (Identity Preference Optimization) |
| Max prompt length | 768 tokens |
| Reference model | None (implicit via PEFT adapter trick) |
Why IPO over Standard DPO
IPO (Azar et al., 2023) replaces DPO’s log-sigmoid loss with a squared error objective targeting an explicit margin:
L_IPO = ( log(π(y_w|x)/π_ref(y_w|x)) − log(π(y_l|x)/π_ref(y_l|x)) − 1/(2β) )²
| Property | DPO | IPO |
|---|---|---|
| Loss shape | Log-sigmoid (saturates) | Squared error (no saturation) |
| Gradient behavior | Vanishes for well-separated pairs | Constant pressure toward target margin |
| Small-data stability | Prone to overfitting | Robust with ~500 samples |
| Target margin | Implicit (maximize gap) | Explicit: 1/(2β) = 10 |
With β = 0.05, the target margin 1/(2β) = 10 creates strong pressure to separate chosen from rejected. IPO’s non-saturating loss is critical with only ~500 synthetic training samples where standard DPO would overfit.
Implicit reference model: Setting ref_model=None in TRL’s DPOTrainer leverages the PEFT adapter architecture — the frozen base model weights serve as the implicit reference while the LoRA delta represents the policy. This halves GPU memory by avoiding a separate reference model copy.
Model Architecture
| Component | Specification |
|---|---|
| Base Model | Qwen2.5-Coder-7B-Instruct (4-bit NF4 via Unsloth) |
| LoRA Rank / Alpha | 32 / 64 (alpha/r = 2.0) |
| LoRA Targets | All 7 projections: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| RSLoRA | Enabled — normalizes scaling by 1/√r instead of 1/r, preventing gradient explosion with high-rank adapters |
| Dropout | 0 (optimized for Unsloth) |
| Gradient Checkpointing | Unsloth-optimized variant |
Rank 32 is unusually high for LoRA (typical: 8–16), giving the adapter significantly more capacity — justified for a task requiring fine-grained structural understanding of JSON schema constraints across diverse API contracts.
Evaluation: Strict Schema Pass Rate (SSPR)
SSPR is a binary, all-or-nothing metric: a response either fully conforms to the schema or it does not. No partial credit.
Failures are classified into 6 categories via jsonschema.validate() error messages: json_error, hallucinated_param ("Additional properties"), type_mismatch ("is not of type"), enum_violation ("is not one of"), missing_required ("is a required property"), and other_schema_error.
Evaluation uses greedy decoding (max_new_tokens=128) with the same ChatML prompt template used during training. Both baseline and DPO models are evaluated on identical data for direct comparison.
Results
| Metric | Baseline | DPO-Aligned | Change |
|---|---|---|---|
| SSPR | 7.48% | 23.19% | 3.1× |
| JSON syntax errors | — | — | −67% |
| Missing required fields | — | — | −22% |
Evaluated on 802 diverse tool-calling samples. The 3.1× improvement demonstrates that hard negative preference learning effectively teaches subtle schema constraint satisfaction that SFT alone cannot achieve.
Tech Stack
Python, PyTorch (2.2+), Hugging Face Transformers, TRL (DPOTrainer, DPOConfig), Unsloth (FastLanguageModel), PEFT (RSLoRA), bitsandbytes (NF4), jsonschema, OpenAI GPT-4o, Datasets, Accelerate