Tool-Use DPO: Schema-Constrained Alignment via Identity Preference Optimization

Overview

Tool-Use DPO addresses a critical production failure mode: LLMs generating tool calls that are syntactically valid JSON but violate the API schema — wrong types ("5" instead of 5), hallucinated parameters, invalid enum values, missing required fields. The system implements a two-stage pipeline (SFT cold start → DPO with Identity Preference Optimization) trained on GPT-4o-generated hard negative preference pairs — rejected responses that are valid JSON but fail schema validation in exactly the ways real models fail. Using Qwen2.5-Coder-7B-Instruct with RSLoRA (rank 32, all 7 projection modules), the method achieves a 3.1× improvement in Strict Schema Pass Rate (7.48% → 23.19%) with 67% fewer JSON syntax errors and 22% fewer missing required fields on 802 diverse tool-calling samples.

The Hard Negative Philosophy

The core insight: easy negatives don’t teach discrimination. Malformed JSON or gibberish is trivially distinguishable — models already avoid obvious garbage. Instead, the training data contains rejected responses that are plausible, syntactically correct JSON that fails in precisely the four ways production models fail:

Error Category	Example
Type Mismatch	`"limit": "10"` instead of `"limit": 10`
Enum Violation	`"priority": "urgent"` when schema allows `["high", "medium", "low"]`
Hallucinated Parameter	Adding `"verbose": true` when no such field exists in the schema
Missing Required	Omitting a mandatory `"title"` field

Each training triplet (prompt, chosen, rejected) carries its own JSON Schema, enabling the model to generalize across diverse API contracts rather than memorizing a fixed schema.

Synthetic Data Generation

GPT-4o generates triplets with a structured system prompt enforcing the hard negative constraint. Every generated sample passes through a strict validation gate before inclusion:

chosen_valid, _ = validate_tool_call(chosen_str, schema)    # Must pass
rejected_valid, _ = validate_tool_call(rejected_str, schema)  # Must fail

if chosen_valid and not rejected_valid:
    accept(sample)   # Correct: chosen passes, rejected fails
else:
    discard(sample)   # GPT-4o made an error — self-correcting pipeline

Validation uses the jsonschema library with full Draft 4/6/7 support, including additionalProperties checks. The extract_json() utility handles models wrapping output in markdown code blocks or embedding JSON in surrounding text via brace-depth matching.

Two-Stage Training Pipeline

Stage 1: SFT Cold Start

Format-tuning phase teaching the model to output raw JSON with proper ChatML structure (<|im_start|>/<|im_end|> delimiters). Uses only (prompt, chosen) pairs from the DPO triplets.

Parameter	Value
Learning rate	`3e-5`
Epochs	6
Batch size	8 × 2 gradient accumulation = 16 effective
Warmup	100 steps
Optimizer	`adamw_8bit`

Why SFT first? DPO alone struggles without foundational format knowledge. The SFT stage ensures the model can reliably produce JSON before learning which JSON to prefer — without this, the DPO signal is too noisy to distinguish “bad format” from “bad content.”

Stage 2: DPO with IPO Loss

Preference optimization using the SFT checkpoint as initialization.

Parameter	Value
Learning rate	`1e-6` (30× smaller than SFT)
Epochs	3
Batch size	4 × 4 gradient accumulation = 16 effective
Beta	0.05
Loss type	IPO (Identity Preference Optimization)
Max prompt length	768 tokens
Reference model	`None` (implicit via PEFT adapter trick)

Why IPO over Standard DPO

IPO (Azar et al., 2023) replaces DPO’s log-sigmoid loss with a squared error objective targeting an explicit margin:

L_IPO = ( log(π(y_w|x)/π_ref(y_w|x)) − log(π(y_l|x)/π_ref(y_l|x)) − 1/(2β) )²

Property	DPO	IPO
Loss shape	Log-sigmoid (saturates)	Squared error (no saturation)
Gradient behavior	Vanishes for well-separated pairs	Constant pressure toward target margin
Small-data stability	Prone to overfitting	Robust with ~500 samples
Target margin	Implicit (maximize gap)	Explicit: `1/(2β) = 10`

With β = 0.05, the target margin 1/(2β) = 10 creates strong pressure to separate chosen from rejected. IPO’s non-saturating loss is critical with only ~500 synthetic training samples where standard DPO would overfit.

Implicit reference model: Setting ref_model=None in TRL’s DPOTrainer leverages the PEFT adapter architecture — the frozen base model weights serve as the implicit reference while the LoRA delta represents the policy. This halves GPU memory by avoiding a separate reference model copy.

Model Architecture

Component	Specification
Base Model	Qwen2.5-Coder-7B-Instruct (4-bit NF4 via Unsloth)
LoRA Rank / Alpha	32 / 64 (alpha/r = 2.0)
LoRA Targets	All 7 projections: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
RSLoRA	Enabled — normalizes scaling by `1/√r` instead of `1/r`, preventing gradient explosion with high-rank adapters
Dropout	0 (optimized for Unsloth)
Gradient Checkpointing	Unsloth-optimized variant

Rank 32 is unusually high for LoRA (typical: 8–16), giving the adapter significantly more capacity — justified for a task requiring fine-grained structural understanding of JSON schema constraints across diverse API contracts.

Evaluation: Strict Schema Pass Rate (SSPR)

SSPR is a binary, all-or-nothing metric: a response either fully conforms to the schema or it does not. No partial credit.

Failures are classified into 6 categories via jsonschema.validate() error messages: json_error, hallucinated_param ("Additional properties"), type_mismatch ("is not of type"), enum_violation ("is not one of"), missing_required ("is a required property"), and other_schema_error.

Evaluation uses greedy decoding (max_new_tokens=128) with the same ChatML prompt template used during training. Both baseline and DPO models are evaluated on identical data for direct comparison.

Results

Metric	Baseline	DPO-Aligned	Change
SSPR	7.48%	23.19%	3.1×
JSON syntax errors	—	—	−67%
Missing required fields	—	—	−22%

Evaluated on 802 diverse tool-calling samples. The 3.1× improvement demonstrates that hard negative preference learning effectively teaches subtle schema constraint satisfaction that SFT alone cannot achieve.

Tech Stack

Python, PyTorch (2.2+), Hugging Face Transformers, TRL (DPOTrainer, DPOConfig), Unsloth (FastLanguageModel), PEFT (RSLoRA), bitsandbytes (NF4), jsonschema, OpenAI GPT-4o, Datasets, Accelerate