Video-DPO: Temporal Alignment for Video Diffusion via Direct Preference Optimization

Overview

Video-DPO adapts the Diffusion-DPO framework (Wallace et al., 2023) from image generation to video generation, targeting temporal consistency in AnimateDiff-based models. The core insight: temporal coherence can be framed as a preference learning problem — the model learns to prefer smoothly-transitioning videos over temporally jittery ones through synthetic preference pairs, without requiring a separate reward model. LoRA adapters are applied exclusively to motion module attention layers, preserving spatial generation quality while improving temporal coherence. A memory-optimized adapter-toggling scheme halves GPU memory requirements by eliminating the need for a separate reference model copy. On 47 evaluation samples, the method achieves 25.7% reduction in optical-flow warping error and 18.3% reduction in frame flickering.

Diffusion-DPO for Video

The method extends DPO to the diffusion setting by defining an implicit reward as the denoising quality gap between the policy and a frozen reference model:

r(x) = ||ref_pred − noise||² − ||policy_pred − noise||²

The DPO loss maximizes the reward margin between winner (temporally consistent) and loser (jittery) videos:

L_DPO = −log σ( β · (r_w − r_l) )

where β = 2500 (notably higher than typical LLM DPO betas of 0.1–0.5, reflecting the different reward scale in diffusion MSE space). A critical design choice: identical noise vectors and timesteps are shared between winner and loser samples within each training step to ensure a fair comparison — any reward difference is attributable purely to the latent quality, not random noise variation.

Training operates entirely in VAE latent space — videos are pre-encoded offline as [C=4, F=16, H=64, W=64] tensors at 1/8 spatial resolution, decoupling expensive pixel-space operations from the training loop. Each .pt file stores (latents_w, latents_l, prompt_embeds) for efficient DataLoader throughput.

Preference Data Construction

Winners are standard AnimateDiff outputs with smooth temporal transitions. Losers are synthetically degraded with temporal jitter via two methods:

Noise Injection (default, fast): Independent per-frame Gaussian noise N(0, s²) plus random per-channel color shifts U(-s/4, s/4) with s = 0.15. Each frame receives different perturbations, breaking temporal coherence while preserving per-frame visual quality.

Img2Img Regeneration (higher quality): Each frame is independently regenerated through StableDiffusionImg2ImgPipeline with a unique random seed per frame — semantically coherent but temporally inconsistent because each frame’s generation is stochastically independent.

Multi-prompt training (8 diverse prompts covering cinematic shots, drone footage, slow motion, timelapse, wildlife, etc.) ensures the temporal preference signal generalizes across content types.

Architecture and LoRA Targeting

Component Specification
SD Backbone epiCRealism (Stable Diffusion 1.5 fine-tune)
Motion Adapter AnimateDiff v1.5.2 (guoyww/animatediff-motion-adapter-v1-5-2)
LoRA Targets Motion module attention only: to_q, to_k, to_v, to_out.0
LoRA Rank/Alpha 32 / 64 (A100), 16 / 32 (L4), 4 / 8 (T4)

The LoRA targeting strategy is critical: by training only temporal attention projections within motion_modules, the base UNet’s spatial generation weights remain frozen — the model learns better temporal coherence without degrading spatial fidelity. Target modules are identified via named module traversal with a regex fallback: r".*motion_modules.*(to_q|to_k|to_v|to_out\.0)$".

Memory-Optimized Reference Model

Two modes for computing the frozen reference model’s predictions:

Standard mode: Deep-copies the full UNet before LoRA injection (~8GB additional VRAM). Reference and policy are separate model copies.

Memory-optimized mode (adapter toggling): No deep copy. The reference forward pass temporarily disables LoRA adapters:

model.disable_adapter_layers()    # → Reference model
ref_pred = model(noisy, t, prompt_embeds)
model.enable_adapter_layers()     # → Policy model
policy_pred = model(noisy, t, prompt_embeds)

This saves ~50% GPU memory at the cost of additional forward passes, enabling training on 16GB GPUs (T4).

Training Configuration

Parameter Value
Optimizer AdamW (lr=1e-5, weight_decay=0.01)
Batch size 1 (with 4-step gradient accumulation → effective 4)
Mixed precision fp16 via Accelerate
Max gradient norm 1.0
Noise scheduler DDPM (1,000 diffusion timesteps)
Inference scheduler DDIM (25 steps)
Checkpointing Every 200 steps, LoRA-only saves via PEFT

At inference, LoRA weights are merged into the base UNet via PeftModel.merge_and_unload() for zero-overhead generation.

Evaluation: Optical Flow Warping Error

Two temporal consistency metrics, both computed over consecutive frame pairs:

Warping Error (primary): Computes simplified Lucas-Kanade optical flow between frames t and t+1 (Sobel gradients + regularized least squares), warps frame t forward via F.grid_sample with bilinear interpolation, and measures MSE against actual frame t+1. Lower warping error indicates the model produces frame transitions that are well-predicted by optical flow — i.e., smooth, physically plausible motion.

Frame Difference: Average absolute pixel difference between consecutive frames — a direct measure of flickering intensity.

Results

Metric Base Model DPO Model Improvement
Warping Error 150.32 111.62 −25.7%
Frame Difference 7.20 5.88 −18.3%

Evaluated on 47 video samples generated with identical seeds for direct comparison — differences are attributable solely to the LoRA weights, not random variation. DPO loss converges from ~0.7 to near zero with consistent reward_w > reward_l separation throughout training.

Tech Stack

Python, PyTorch (2.1+), Diffusers (AnimateDiffPipeline, DDPMScheduler, DDIMScheduler), PEFT (LoRA), Hugging Face Transformers, Accelerate, OpenCV, imageio, NumPy