Nano-Reasoner: Unified Post-Training Framework for Math Reasoning Models
Overview
Nano-Reasoner is a unified, research-grade post-training framework for training small language models to perform chain-of-thought mathematical reasoning via Reinforcement Learning from Verifiable Rewards (RLVR). The framework implements a complete two-phase pipeline — Supervised Fine-Tuning (SFT) cold start followed by RL optimization — with six RL algorithms implemented from scratch in pure PyTorch (no dependency on TRL or external RL libraries), achieving 84%+ accuracy on GSM8K with a 1.5B-parameter model.
Architecture
Base Model: Qwen2.5-Math-1.5B-Instruct (1.5B parameters) with ChatML-style tokenization.
Parameter-Efficient Training: LoRA adapters (rank 16, alpha 32) targeting all linear transformations — attention projections (q_proj, k_proj, v_proj, o_proj) and MLP layers (gate_proj, up_proj, down_proj) — training only ~1–2% of total parameters.
Memory Optimization Stack: Three complementary techniques enable training on GPUs as small as 16GB:
- 4-bit NF4 quantization with double quantization via bitsandbytes (~8x weight storage reduction)
- LoRA adapters (minimal trainable parameters)
- Gradient checkpointing (Unsloth-optimized when available)
Dual Backend Support: Auto-detects and uses Unsloth’s FastLanguageModel for optimized CUDA training, with graceful fallback to standard HuggingFace Transformers + PEFT + bitsandbytes on non-CUDA hardware. Flash Attention 2 is auto-detected for 2–3x attention speedup on A100/H100.
Training Pipeline
Phase 1: SFT Cold Start (~30 min, A100)
Format-tuning phase that teaches the model to emit <think>...</think> chain-of-thought reasoning traces and \boxed{} answer formatting — a prerequisite for RL, since the reward function requires structured output to extract and verify answers.
- Standard causal LM loss with padding-masked cross-entropy (ignore index
-100) - Learning rate:
2e-5, batch size: 4, max 5,000 samples from OpenR1-Math-220K
Phase 2: Reinforcement Learning (~10 hrs, A100)
Six RL algorithms with a unified rollout-and-optimize loop:
- Prompt expansion: Each prompt replicated
Gtimes (defaultG=8) for group-based advantage estimation - Rollout generation: Sampling with
temperature=0.8,max_new_tokens=384 - Binary reward: Rule-based verification via
\boxed{}extraction with nested brace handling (correct=1.0, incorrect=0.0) - Old-policy log-probs: Computed under
torch.no_grad()for importance sampling ratios - Inner optimization:
ppo_epochs=2gradient updates per rollout withclip_grad_norm=1.0 - Aggressive memory management: Explicit tensor deletion,
gc.collect(), andtorch.cuda.empty_cache()after each step
RL Algorithms
All algorithms are implemented from scratch in pure PyTorch with a unified interface — they share the same rollout/generation code and differ only in their loss computation:
GRPO (Group Relative Policy Optimization): Generates G completions per prompt, normalizes advantages within each group (mean/std), and optimizes a clipped surrogate objective (ε=0.2, clip range [0.8, 1.2]).
Dr. GRPO (Length-Corrected GRPO): Addresses length bias in vanilla GRPO by introducing a per-sample length correction factor L_i / μ_L, normalizing gradient contributions so that concise correct answers are not under-weighted relative to verbose ones. Best-performing algorithm (84% GSM8K).
PPO (Proximal Policy Optimization): Adds a learned value head (nn.Linear(hidden_size, 1)) with a separate AdamW optimizer (lr=1e-4). Implements Generalized Advantage Estimation (GAE) with γ=0.99, λ=0.95 over sparse terminal rewards. Value loss (0.5 × MSE) added to the clipped surrogate objective.
GSPO (Group Sequence Policy Optimization): Replaces token-level importance ratios with a sequence-level geometric mean ratio ρ_seq = exp(Σ log(π/π_old) / |seq|), providing more stable optimization for sequence-level rewards.
DAPO (Decoupled Clip and Dynamic Sampling): Two innovations — (1) dynamic sampling that filters out zero-variance prompt groups (all correct or all incorrect) since they provide no gradient signal, and (2) asymmetric clipping with a wider upper clip (1.28) for positive advantages to encourage exploitation of good actions.
GRPO-LEAD (Length and Difficulty Aware): Curriculum-style training with (1) a length penalty exp(-0.1 × |z_score|) on correct answers to encourage conciseness, and (2) difficulty weighting (2.0 - pass_rate) that up-weights harder problems (lower group pass rates) by up to 2x.
Results
| Stage | GSM8K Accuracy | Training Time |
|---|---|---|
| Base model (zero-shot) | ~65% | — |
| + SFT cold start | ~72% | ~30 min |
| + GRPO | ~82% | ~10 hrs |
| + Dr. GRPO | ~84% | ~10 hrs |
+19 percentage points improvement over the base model on a 1.5B-parameter model, competitive with significantly larger models on the GSM8K grade-school math benchmark.
Monitoring & Evaluation
- Weights & Biases integration for tracking loss, mean reward, KL divergence, policy entropy, average response length, and accuracy per training step
- Standalone evaluation script with greedy and temperature-controlled decoding, reporting Pass@1 accuracy on the GSM8K test split
Tech Stack
Python, PyTorch, Hugging Face Transformers, PEFT (LoRA), bitsandbytes (NF4), Unsloth, Flash Attention 2, Accelerate, Weights & Biases