How Not to Give a FLOP: Combining Regularization and Structured Pruning

Overview

Published as arXiv:2003.13593, this project investigates the synergy between data-level regularization and structured filter pruning for efficient deep network inference. The central finding is that Cutout + Soft Filter Pruning achieves Pareto dominance over unpruned baselines — simultaneously improving accuracy by up to +2.03% while reducing FLOPs by ~15% across five ResNet depths (20/32/44/56/110) on CIFAR-10. The core insight is that regularization redistributes discriminative capacity more uniformly across filters, so that when the lowest-norm filters are zeroed by pruning, they are truly redundant rather than containing critical specialized features. The study evaluates a 5×6 experiment matrix (5 architectures × 6 configurations: baseline, Mixup-only, Cutout-only, pruning-only, Mixup+pruning, Cutout+pruning) totaling 30 experiments, revealing that Cutout+Pruning consistently outperforms Mixup+Pruning by 0.93–1.53%, despite Mixup being the stronger standalone regularizer. Implemented in PyTorch with PyTorch Lightning and Weights & Biases experiment tracking.

ResNet Architectures: CIFAR-Scale Family

All architectures follow He et al. (2015) with 3 stages, channel widths 16 → 32 → 64, and spatial resolutions 32×32 → 16×16 → 8×8. Each residual block uses two 3×3 convolutions with BatchNorm + ReLU and Option A shortcut connections (identity with zero-padding for dimension changes — no learnable projection parameters):

self.shortcut = LambdaLayer(lambda x:
    F.pad(x[:, :, ::2, ::2], (0, 0, 0, 0, planes//4, planes//4), "constant", 0))

Model	Blocks per Stage	Total Layers	Parameters	MFLOPs
ResNet-20	[3, 3, 3]	20	~0.27M	40.6
ResNet-32	[5, 5, 5]	32	~0.46M	68.9
ResNet-44	[7, 7, 7]	44	~0.66M	97.2
ResNet-56	[9, 9, 9]	56	~0.85M	125.5
ResNet-110	[18, 18, 18]	110	~1.73M	252.9

Initial layer: single 3×3 Conv2D (16 channels, no max pooling) → global average pooling → linear classifier. All weights initialized via Kaiming Normal (init.kaiming_normal_).

Two Regularization Techniques

Mixup (Zhang et al., 2018)

Applied during training_step() by creating convex combinations of input-label pairs within each minibatch:

x̃ = λ·xᵢ + (1−λ)·xⱼ,   λ ~ Beta(α, α),  α = 1.0
L_mixup = λ·L(pred, yₐ) + (1−λ)·L(pred, yᵦ)

Permutation index generated via torch.randperm(batch_size). When α=1.0, lambda follows a uniform distribution over [0, 1].

Cutout (DeVries & Taylor, 2017)

Applied as a torchvision transform (after normalization, generating a different random mask per image access):

n_holes=1 square patch per image, length=16 pixels per side
Random center (x, y) sampled uniformly; patch extends beyond boundaries (clipped by np.clip)
Zeros out normalized pixel values, forcing the network to recognize objects from partial observations

Design distinction: Cutout operates at the input level (spatial masking), while Mixup operates at the label level (inter-class blending). This difference proves critical for pruning compatibility.

Soft Filter Pruning (He et al., 2018)

A structured pruning method operating at the filter level — entire convolutional filters are zeroed, not individual weights.

Algorithm (executed at end of each epoch):

L2-norm ranking: For each Conv2D layer, compute ‖Fᵢ‖₂ = √(Σ w²) across all kernel dimensions per filter
Selection: Sort by L2-norm; select the bottom (1 − pruning_rate) fraction for zeroing
Binary codebook masking: Multiply filter weights element-wise by {0, 1} mask
Soft reset: Unlike hard pruning, zeroed filters remain in the architecture and can be restored by backpropagation in subsequent epochs

The “soft” aspect is key: the network continuously reallocates capacity, re-evaluating which filters to keep at every epoch — analogous to a dynamic lottery ticket search. With default pruning_rate=0.9 (90% retained), ~15% FLOP reduction is achieved uniformly across architectures.

Layer selection: Pruning applies to every convolutional layer except downsample shortcuts and the final classifier. For CIFAR-scale ResNets using Option A (zero-padding) shortcuts, no skip lists are needed since shortcuts have no learnable parameters.

Training Configuration

Parameter	Default	Cutout Variant
Optimizer	SGD (momentum=0.9, weight_decay=1e-4)	Same
Learning rate	0.1	0.1
LR schedule	MultiStepLR [100, 150], γ=0.1	MultiStepLR [60, 120, 160], γ=0.2
Batch size	128 (train), 100 (test)	Same
Epochs	200	200
Augmentation	RandomCrop(32, pad=4) + HorizontalFlip	Same + Cutout(n=1, l=16)
Pruning frequency	Every epoch	Every epoch
Pruning rate	0.9 (90% retained)	0.9

Results: 5×6 Experiment Matrix on CIFAR-10

Method	ResNet-20	ResNet-32	ResNet-44	ResNet-56	ResNet-110
Baseline	91.63%	92.11%	92.54%	92.49%	92.58%
Mixup	92.67%	93.39%	94.16%	94.15%	94.71%
Cutout	91.81%	93.48%	93.60%	93.86%	93.95%
Pruning only	91.47%	91.40%	92.23%	91.42%	91.88%
Mixup + Pruning	91.94%	93.06%	93.12%	93.80%	93.04%
Cutout + Pruning	92.87%	93.28%	94.12%	94.52%	94.57%

Accuracy Deltas vs. Baseline

Method	ResNet-20	ResNet-32	ResNet-44	ResNet-56	ResNet-110
Mixup only	+1.04	+1.28	+1.62	+1.66	+2.13
Cutout only	+0.18	+1.37	+1.06	+1.37	+1.37
Pruning only	−0.16	−0.71	−0.31	−1.07	−0.70
Mixup + Pruning	+0.31	+0.95	+0.58	+1.31	+0.46
Cutout + Pruning	+1.24	+1.17	+1.58	+2.03	+1.99

Key Findings

Pruning alone degrades accuracy by 0.16–1.07% across all depths — removing filters without regularization loses critical features.
Cutout + Pruning achieves Pareto dominance: Better accuracy AND ~15% fewer FLOPs vs. unpruned baselines at every depth.
Cutout > Mixup for pruning compatibility: Cutout+Pruning outperforms Mixup+Pruning by 0.93–1.53% despite Mixup being the stronger standalone regularizer (+1.55% avg vs. +1.07% avg). This is because Cutout’s spatial masking forces filter-level redundancy (each filter must be robust to missing input regions), while Mixup’s inter-class blending spreads information across classes but not necessarily across filters.
Benefits scale with depth: Cutout+Pruning improvement grows from +1.24% (ResNet-20) to +2.03% (ResNet-56), suggesting deeper networks have more prunable redundancy when properly regularized.
Near-free compression: The accuracy gap between Cutout-only and Cutout+Pruning averages only 0.53%, meaning ~15% FLOP reduction comes nearly “for free.”

FLOP Reduction Analysis

Analytical FLOP computation models the per-layer reduction: for the first convolution per block (only output channels pruned), FLOPs scale by retention rate r; for the second convolution (both input and output pruned), FLOPs scale by r².

Pruning Rate	Filter Retention	Approx. FLOP Reduction
0.9	90%	~15%
0.8	80%	~28%
0.7	70%	~40%

Demo

Tech Stack

Python (3.7+), PyTorch, PyTorch Lightning, Weights & Biases (experiment tracking), TensorBoard, NumPy