GANime: Generating Anime Characters from Sketches with Deep Learning
Overview
Published as arXiv:2508.09207 and developed as a Stanford CS230 project, GANime systematically benchmarks four neural approaches to anime sketch colorization — optimization-based Neural Style Transfer (NST), single-pass Fast NST, paired conditional GAN (Pix2Pix), and unpaired cycle-consistent GAN (CycleGAN) — on 17,769 sketch-color image pairs at 256×256 resolution. The primary technical contribution is the addition of Total Variation (TV) regularization to the standard Pix2Pix objective, suppressing color-bleeding artifacts at region boundaries and yielding a 3.3% additional FID improvement (227.9 → 220.5) over vanilla Pix2Pix. The TV-regularized Pix2Pix achieves an FID of 220.5 and SSIM of 0.756 — a 36.2% FID reduction over the NST baseline (345.5) — while CycleGAN reaches FID=272.6 without requiring paired supervision. All architectures share a modular Pix2Pix_Net backbone providing U-Net generators and PatchGAN discriminators, with a custom InstanceNormalization Keras layer for CycleGAN’s per-instance normalization. Implemented in TensorFlow 2.1 with @tf.function graph-mode training.
Four Colorization Approaches
| Model | Approach | Supervision | FID | SSIM |
|---|---|---|---|---|
| Neural Style Transfer | Per-image VGG19 optimization | Per-image (1,000 steps) | 345.5 | 0.655 |
| Fast NST | Pre-trained TF Hub forward pass | None (pretrained) | — | — |
| Pix2Pix (C-GAN) | Conditional adversarial + L1 | Paired | 227.9 | 0.747 |
| Pix2Pix + TV Loss | Conditional adversarial + L1 + TV | Paired | 220.5 | 0.756 |
| CycleGAN | Cycle-consistent adversarial | Unpaired | 272.6 | 0.724 |
U-Net Generator Architecture
The generator follows a modified U-Net with 8 encoder blocks and 7 decoder blocks + 1 final transposed convolution, accepting flexible-resolution 3-channel inputs. All convolutional layers use tf.random_normal_initializer(0., 0.02) with use_bias=False.
Encoder (downsampling path) — each block: Conv2D(k=4, s=2, 'same') → Normalization → LeakyReLU:
| Block | Filters | Output (256×256 input) | Normalization |
|---|---|---|---|
| down1 | 64 | 128×128×64 | None |
| down2 | 128 | 64×64×128 | Yes |
| down3 | 256 | 32×32×256 | Yes |
| down4 | 512 | 16×16×512 | Yes |
| down5 | 512 | 8×8×512 | Yes |
| down6 | 512 | 4×4×512 | Yes |
| down7 | 512 | 2×2×512 | Yes |
| down8 | 512 | 1×1×512 | Yes (bottleneck) |
Decoder (upsampling path) — each block: Conv2DTranspose(k=4, s=2, 'same') → Normalization → [Dropout(0.5)] → ReLU → Concatenate with corresponding encoder skip connection:
| Block | Filters | After Skip Concat | Dropout |
|---|---|---|---|
| up1 | 512 | 2×2×1024 | 0.5 |
| up2 | 512 | 4×4×1024 | 0.5 |
| up3 | 512 | 8×8×1024 | 0.5 |
| up4 | 512 | 16×16×1024 | No |
| up5 | 256 | 32×32×512 | No |
| up6 | 128 | 64×64×256 | No |
| up7 | 64 | 128×128×128 | No |
Final layer: Conv2DTranspose(3, k=4, s=2, activation='tanh') → output (256, 256, 3) in [−1, 1]. The bottleneck output (down8) is excluded from skip connections, yielding 7 skip connections total that preserve fine-grained spatial details like hair strands, eye highlights, and clothing edges.
70×70 PatchGAN Discriminator
Rather than classifying the entire image, the discriminator produces a 30×30 grid of patch predictions, each with a ~70×70 pixel receptive field — enforcing high-frequency structural accuracy. For Pix2Pix (conditional), input and target/generated images are concatenated along the channel dimension (6-channel input); for CycleGAN, only a single 3-channel image is input.
| Layer | Type | Filters | Kernel/Stride | Output |
|---|---|---|---|---|
| down1 | Conv2D | 64 | k=4, s=2 | 128×128×64 |
| down2 | Conv2D + Norm | 128 | k=4, s=2 | 64×64×128 |
| down3 | Conv2D + Norm | 256 | k=4, s=2 | 32×32×256 |
| zero_pad + conv | ZeroPad + Conv2D + Norm | 512 | k=4, s=1 | 31×31×512 |
| zero_pad + last | ZeroPad + Conv2D | 1 | k=4, s=1 | 30×30×1 |
No sigmoid activation — raw logits used with BinaryCrossentropy(from_logits=True).
BatchNorm vs. InstanceNorm: Normalization Design
A critical architectural distinction: Pix2Pix uses BatchNormalization while CycleGAN uses a custom InstanceNormalization Keras layer that normalizes per-instance across spatial dimensions [1, 2] independently per sample and per channel:
class InstanceNormalization(keras.layers.Layer):
def build(self, input_shape):
self.scale = self.add_weight('scale', shape=input_shape[-1:],
initializer=tf.random_normal_initializer(1., 0.02))
self.offset = self.add_weight('offset', shape=input_shape[-1:],
initializer='zeros')
def call(self, x):
mean, variance = tf.nn.moments(x, axes=[1, 2], keepdims=True)
return self.scale * (x - mean) * tf.math.rsqrt(variance + 1e-5) + self.offset
Learnable affine parameters: scale ~ N(1.0, 0.02), offset = 0. This avoids the TensorFlow Addons dependency and provides full control over initialization — important because InstanceNorm prevents cross-sample style leakage in CycleGAN’s unpaired training regime.
Loss Functions
Pix2Pix: Adversarial + L1 + TV Regularization
L_G = L_adversarial + 100 · L_L1 + 1e-4 · L_TV
| Component | Weight | Formula |
|---|---|---|
| Adversarial | 1.0 | BCE(ones_like(D(G(x))), D(G(x))) — fool discriminator |
| L1 Reconstruction | 100.0 | mean(\|target − G(x)\|) — pixel-wise fidelity |
| Total Variation | 1e-4 | mean(\|∇_x G\|) + mean(\|∇_y G\|) — spatial smoothness |
TV regularization (the novel contribution) computes L1-norm finite differences between horizontally and vertically adjacent pixels, suppressing artifacts and color bleeding at region boundaries. Supports both L1 (anisotropic) and L2 (isotropic) variants via the --norm-tv-loss flag. The 3.3% FID improvement (227.9 → 220.5) validates its effectiveness for anime colorization where sharp color boundaries between hair, skin, and clothing are critical.
CycleGAN: Adversarial + Cycle Consistency + Identity
L_G = L_adversarial(G) + 10 · L_cycle(X→Y→X) + 10 · L_cycle(Y→X→Y) + 5 · L_identity(G)
Cycle consistency loss enforces F(G(x)) ≈ x and G(F(y)) ≈ y (L1 distance), enabling unpaired training. Identity loss \|G(y) − y\| preserves color when the input is already in the target domain. Discriminator losses use a 0.5 scaling factor (distinct from Pix2Pix’s unscaled discriminators).
Neural Style Transfer: Gram Matrix Optimization
L = (1e-2 / 5) · Σ MSE(Gram(style_out), Gram(style_target))
+ (1e4 / 1) · Σ MSE(content_out, content_target)
Style layers: block{1-5}_conv1 (5 layers). Content layer: block5_conv2. Gram matrices computed via Einstein summation (tf.linalg.einsum('bijc,bijd->bcd')) normalized by spatial dimensions. Optimized via Adam (lr=0.02, β₁=0.99, ε=0.1) for 1,000 steps × 100 epochs per image.
Dataset: 17,769 Anime Sketch-Color Pairs
| Split | Count | Ratio |
|---|---|---|
| Training | 14,224 | 80% |
| Test | 3,545 | 20% |
| Total | 17,769 | 100% |
Source: Kaggle ktaebum/anime-sketch-colorization-pair, automated download via Kaggle API. Each source image is a horizontally concatenated color | sketch pair, split at the midpoint and resized to 256×256 via nearest-neighbor interpolation. |
Training augmentation (via @tf.function-decorated random_jitter()): resize to 286×286 → random crop to 256×256 → random horizontal flip (50%). All pixels normalized to [−1, 1] via (pixel / 127.5) − 1.
Training Configuration
| Parameter | Pix2Pix | CycleGAN | NST |
|---|---|---|---|
| Optimizer | Adam | Adam | Adam |
| Learning rate | 2e-4 | 2e-4 | 0.02 |
| β₁ / β₂ | 0.5 / 0.999 | 0.5 / 0.999 | 0.99 / 0.999 |
| Batch size | 32 | 8 | 1 |
| Epochs | 150 | 150 | 1,000 steps × 100 |
| Image size | 256×256 | 256×256 | max 512px |
| Checkpoint freq | Every 5 epochs | Every 5 epochs | — |
| Networks | 2 (G + D) | 4 (G_g, G_f, D_x, D_y) | 0 (VGG19 frozen) |
CycleGAN uses tf.GradientTape(persistent=True) since gradients for all four networks must be computed from the same forward pass. Both GAN models use @tf.function-decorated training steps for graph-mode execution with TensorBoard logging for all loss components.
Evaluation
FID (Frechet Inception Distance): InceptionV3 (include_top=False, pooling='avg') extracts 2048-dimensional features; FID = ‖μ₁ − μ₂‖² + Tr(Σ₁ + Σ₂ − 2√(Σ₁Σ₂)) via scipy.linalg.sqrtm.
SSIM: tf.image.ssim(max_val=255, filter_size=11, filter_sigma=1.5, k1=0.01, k2=0.03), averaged across all test pairs.
| Model | FID ↓ | SSIM ↑ | FID Reduction vs. NST |
|---|---|---|---|
| Neural Style Transfer | 345.5 | 0.655 | baseline |
| CycleGAN | 272.6 | 0.724 | 21.1% |
| Pix2Pix | 227.9 | 0.747 | 34.0% |
| Pix2Pix + TV Loss | 220.5 | 0.756 | 36.2% |
Key findings: (1) Paired supervision (Pix2Pix) dramatically outperforms unpaired approaches (CycleGAN) with a 55.7 FID gap. (2) TV regularization provides consistent improvement across both FID and SSIM. (3) SSIM plateaus around epoch 10 while FID continues improving until epoch 35, suggesting structural fidelity converges before perceptual quality. (4) PatchGAN effectively preserves fine details in hair, eyes, and clothing.
Demo
Tech Stack
Python (3.7+), TensorFlow (2.1), Keras, VGG19 (feature extraction), InceptionV3 (FID evaluation), TensorFlow Hub (Fast NST), SciPy (matrix square root), OpenCV, Kaggle API, TensorBoard, AWS EC2, Google Colab