Visual-CoT: Pixel-Grounded Reasoning with Multi-Modal Chain-of-Thought
Overview
Visual-CoT fine-tunes a Vision-Language Model to produce spatially-grounded chain-of-thought reasoning traces — explanations that interleave natural language deduction with precise bounding box coordinates anchoring each claim to specific image regions. The model articulates what it observes, where it observes it (via <ref>object</ref><box>[x1, y1, x2, y2]</box> annotations normalized to a 0–1000 coordinate space), and how it reaches conclusions. By training on 135K+ samples from the VisCOT dataset where 100% of training examples contain valid bounding boxes, the system reduces object hallucination — every object mention must be backed by a verifiable spatial location. The framework uses QLoRA (rank 64, 7 target modules) on Qwen2.5-VL-7B-Instruct with a pre-tokenization strategy that sidesteps VLM multimodal collator issues, achieving a final training loss of ~0.62 over ~16,800 steps.
Grounded Reasoning Format
The model’s output interleaves reasoning text with structured spatial references:
First, I identify the <ref>red gear</ref><box>[100, 200, 300, 400]</box> in the mechanism.
Looking at how it connects to the <ref>blue lever</ref><box>[350, 180, 500, 250]</box>...
Therefore, when the gear rotates clockwise, the lever will move upward.
Bounding boxes are generated autoregressively as text tokens — there is no separate detection head. The model learns during fine-tuning to emit coordinate tokens as part of its language generation, leveraging Qwen2.5-VL’s pretrained spatial understanding. All coordinates use integers from 0 to 1000 (where (0,0) is top-left, (1000,1000) is bottom-right), reducing token count per coordinate (3–4 tokens vs. 5–6 for floats) and avoiding floating-point precision issues in text generation.
Training Data Pipeline
Primary Source: VisCOT Dataset — 150,000+ samples with real bounding box annotations from deepcs233/Visual-CoT on HuggingFace. The conversion pipeline parses conversation turns, extracts bounding boxes via regex (handling both 0–1 float and 0–1000 integer formats with automatic normalization), and constructs the <ref>/<box> training format. Quality filtering (filter_quality=True) discards samples without valid bounding boxes, ensuring 100% grounding coverage across training data.
Secondary Source: ScienceQA + GPT-4o Distillation — For image-containing ScienceQA samples, images are base64-encoded and sent to GPT-4o with a system prompt enforcing the <ref>/<box> format (“CRITICAL RULE: Whenever you mention a physical object… you MUST immediately follow it with its bounding box”). Rate-limited to 10 concurrent requests with exponential-backoff retries.
Final split: 135K training / 15K validation (90/10 split). Data quality analysis tracks <ref> tag coverage, V-CoT <box> tag density, and average response lengths.
Model Architecture
| Component | Specification |
|---|---|
| Base VLM | Qwen2.5-VL-7B-Instruct (4-bit NF4 quantization via Unsloth) |
| LoRA Rank / Alpha | 64 / 128 (alpha/r = 2.0) |
| LoRA Targets | All 7 linear layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| LoRA Dropout | 0.05 |
| Gradient Checkpointing | Unsloth-optimized variant |
Dual-path model loading: Auto-detects Unsloth’s FastVisionModel at import time; falls back to standard HuggingFace Transformers + PEFT + BitsAndBytesConfig (NF4, double quantization, float16 compute dtype) when Unsloth is unavailable. The model ID is automatically remapped from the Unsloth quantized variant to Qwen/Qwen2.5-VL-7B-Instruct.
Pre-Tokenization Strategy
A key engineering contribution addresses VLM multimodal collator issues during training. The approach:
- Extracts the text-only tokenizer from the VLM processor (via
tokenizer.tokenizer,tokenizer.text_tokenizer, or fallback toAutoTokenizer) - Converts multimodal messages to plain text using Qwen ChatML format (
<|im_start|>/<|im_end|>tokens), stripping image references - Tokenizes with the text tokenizer only, producing standard
input_ids,attention_mask, andlabels
This trades away pixel-level fine-tuning in exchange for training stability — the assumption is that Qwen2.5-VL’s pretrained visual understanding is already strong enough, and fine-tuning only needs to teach the structured <ref>/<box> output format. Uses DataCollatorForSeq2Seq with dynamic padding rather than the VLM’s multimodal collator.
Training Configuration
| Parameter | Value |
|---|---|
| Learning rate | 1e-4 with cosine decay |
| Warmup ratio | 3% of training |
| Epochs | 2 |
| Batch size | 4 per device × 4 gradient accumulation = 16 effective |
| Weight decay | 0.01 |
| Max gradient norm | 0.3 |
| Max sequence length | 2,048 tokens |
| Optimizer | adamw_8bit (memory-efficient) |
| NEFTune noise alpha | 5 |
| Early stopping | Patience 3, monitoring eval_loss (δ > 0.001) |
| Precision | BF16 if supported, else FP16 |
| Logging | TensorBoard (every 50 steps) |
NEFTune (Jain et al., 2023) adds uniform random noise to the embedding layer during training, scaled by alpha / sqrt(seq_length × hidden_dim). This regularization technique improves instruction-following quality and helps prevent the model from memorizing surface-level patterns.
Hallucination Reduction Mechanism
The system addresses object hallucination through four interconnected mechanisms: (1) Grounded training data — 100% of samples contain valid bounding boxes, teaching the model that every object claim must be spatially verifiable; (2) Structured output format — the <ref>/<box> format forces the model to commit to specific spatial locations, making hallucination harder than generating unconstrained text; (3) NEFTune regularization — noise injection prevents memorization of surface patterns; (4) Evaluation-time IoU verification — predicted boxes are extracted and validated against ground truth.
Evaluation Framework
Three metrics quantify grounding quality:
Grounding IoU: Standard Intersection-over-Union between predicted and ground-truth bounding boxes, with 1e-6 epsilon for numerical stability.
IoU Success Rate: Fraction of predicted boxes achieving IoU > 0.5 against their best-matching ground truth.
Box Detection Rate: Percentage of model responses containing at least one valid <box> annotation — measures the model’s consistency in producing grounded outputs.
Batch evaluation iterates over a JSONL evaluation set, runs greedy inference (do_sample=False, repetition_penalty=1.2) on each sample, parses predicted boxes, and aggregates per-sample IoU scores.
Results
| Metric | Value |
|---|---|
| Training samples | 135,000 |
| Validation samples | 15,000 |
| Samples with bounding boxes | 100% |
| Training steps | ~16,800 |
| Final training loss | ~0.62 |
The Gradio demo provides real-time streaming visualization: a TextIteratorStreamer runs generation in a separate thread while the main thread continuously parses accumulated text for <ref>/<box> patterns, denormalizes coordinates to pixel space, draws green rectangles with labels via OpenCV, and yields updated images to the UI — bounding boxes appear on the image live as the model reasons.
Tech Stack
Python, PyTorch (2.0+), Hugging Face Transformers (4.40+), TRL (SFTTrainer), PEFT (QLoRA), Unsloth (FastVisionModel), bitsandbytes (NF4), Qwen2.5-VL, OpenAI GPT-4o, Gradio, OpenCV, Datasets