FlapAI Bird: Deep Reinforcement Learning for Flappy Bird

Overview

Published as arXiv:2003.09579, FlapAI Bird implements six progressively sophisticated value-based RL algorithms — from a random baseline through tabular SARSA and Q-Learning, linear function approximation, feed-forward DQN, to convolutional DQN with experience replay — all evaluated on the same Flappy Bird environment via OpenAI Gym wrappers around the PyGame Learning Environment (PLE). The system achieves superhuman performance of 2,069 pipes (vs. human baseline of ~20–50) after 10,000 training episodes. Two key design innovations drive performance: (1) backward trajectory processing that propagates the −1000 death penalty in reverse chronological order for dramatically faster credit assignment, and (2) biased exploration priors via a tunable probFlap parameter (default 0.1) encoding domain knowledge that no-flap is the safer action. The convolutional DQN operates directly on 80×80 binary-thresholded pixel frames with a 50,000-transition experience replay buffer, while tabular methods use a 3-feature discretized state representation. All agents share a unified Gym wrapper interface with shaped rewards (+0.5 survival, +5.0 pipe passage, −1000 death) and undiscounted returns (γ=1.0).

Six RL Algorithm Progression

Agent	Algorithm	State Input	Value Storage	Parameters	Update
Baseline	Random (probFlap)	—	None	0	None
SARSA	On-policy TD(0)	Discretized string	Dict	O(\|S\|×\|A\|)	Tabular
Q-Learning	Off-policy TD(0)	Discretized string	Dict	O(\|S\|×\|A\|)	Tabular
Linear FA	Semi-gradient TD	5 normalized features	Weight vector	5	Manual SGD
DNN-DQN	DQN (3-layer MLP)	3 continuous	Neural net	1,262	PyTorch SGD
CNN-DQN	DQN + Experience Replay	80×80 pixels	Neural net	31,746	PyTorch Adam

State Representations

The system provides four specialized Gym wrappers (FlappyBirdNormal, FlappyBirdLR, FlappyBirdDNN, FlappyBirdCNN), each exposing algorithm-appropriate state representations from the same underlying PLE environment:

Tabular methods (SARSA, Q-Learning): Three features discretized via discretize(x, rounding) = rounding × ⌊x / rounding⌋ — player_vel, hor_dist_to_next_pipe, ver_dist_to_next_pipe — concatenated into a string key for dictionary-based Q-tables. The rounding parameter controls a bias-variance trade-off: too fine yields sparse tables with slow convergence; too coarse causes state aliasing.

Linear FA: Five normalized features — player_vel/10, hor_dist/288 (screen width), ver_dist/512 (screen height), action, bias — with sparse dictionary-based weight vectors and manual semi-gradient updates.

DNN-DQN: Raw 3D continuous vector [player_vel, hor_dist, ver_dist] without normalization.

CNN-DQN: 80×80 binary-thresholded pixel frames (see Frame Preprocessing Pipeline below).

Reward Shaping

Event	Reward	Purpose
Surviving a timestep	+0.5	Continuous survival incentive
Passing through a pipe	+5.0	Milestone reward (PLE returns +1, reshaped to +5)
Death/collision	−1000	Large terminal penalty for credit assignment

With undiscounted returns (γ=1.0), the −1000 death penalty treats all future rewards equally, appropriate for Flappy Bird where survival at any future point is equally valuable — but Q-values can grow unboundedly for long episodes.

Tabular Methods: SARSA vs. Q-Learning

Both use defaultdict(float) Q-tables with the same epsilon-greedy exploration, differing only in their TD targets:

SARSA (on-policy): Q(s,a) ← (1−η)·Q(s,a) + η·(r + γ·Q(s',a')) where a' is the action actually selected by the current epsilon-greedy policy — the next action is chosen before the update, making transitions (s, a, r, s', a').

Q-Learning (off-policy): Q(s,a) ← (1−η)·Q(s,a) + η·(r + γ·max_{a'} Q(s',a')) using the greedy max over all next actions regardless of the exploration policy.

This makes SARSA more conservative — it accounts for the “cost of exploration” since it evaluates the policy it’s following, while Q-Learning always targets the optimal policy.

Backward Trajectory Processing

One of the most impactful design decisions: transitions within an episode can be processed in reverse chronological order (--order backward). The large −1000 death penalty at episode termination is propagated backward through the trajectory before earlier transitions are updated, dramatically improving credit assignment for long episodes where the death signal would otherwise take many episodes to propagate to early states through forward processing alone. Q-Learning with backward updates achieves the strongest overall performance.

Neural Network Architectures

Feed-Forward DNN (1,262 parameters)

Input [3] → Linear(3→50) → ReLU → Linear(50→20) → ReLU → Linear(20→2) → ReLU → Output [2]

Per-transition online updates via SGD (lr=0.1), no experience replay. Uses MSE loss between Q(s,a) and r + γ·max_{a'} Q(s',a').

Convolutional DQN (31,746 parameters)

Input [1×80×80] → Conv2d(1→16, k=5, s=2) → ReLU → Conv2d(16→32, k=5, s=2) → ReLU
  → Flatten [9,248] → Linear(9248→2) → Output [2]

Layer	Input → Output	Config	Parameters
Conv2d 1	1×80×80 → 16×38×38	k=5, s=2	416
Conv2d 2	16×38×38 → 32×17×17	k=5, s=2	12,832
Linear	9,248 → 2	—	18,498

Deliberately compact compared to the original Atari DQN (~1.7M parameters). Uses Adam optimizer, minibatch size 32, and 50,000-transition experience replay buffer with 5,000-step warm-up. Actions encoded as one-hot vectors via torch.eye(2) with Q-value selection via element-wise multiplication — an elegant vectorized alternative to index-based selection. The survived mask (1 − done) zeros out the bootstrap term for terminal states.

Experience Replay (CNN Agent)

Property	Specification
Buffer	`collections.deque(maxlen=50,000)` — FIFO ring buffer
Transition	`(state, action_onehot, reward, nextState, survived)`
Sampling	Uniform random without replacement
Warm-up	5,000 transitions before training begins
Batch size	32

During warm-up, the agent explores and stores transitions but does not call updateWeights, ensuring sufficient replay diversity before gradient updates start.

Frame Preprocessing Pipeline

The CNN agent processes raw game frames through a torchvision.transforms.Compose chain:

Crop: frame[:404] — removes bottom score display from the 512-pixel-high frame
Grayscale: 3 RGB channels → 1 intensity channel
Resize: ~288×404 → 80×80 pixels (significant spatial compression)
Binary threshold: Otsu-style at intensity 128 — all pixels >128 become 255 (white), others 0 (black)
Tensor conversion: ToTensor() normalizes to [0, 1]
Batch dimension: Final output [1, 1, 80, 80]

The aggressive binarization eliminates color variation, texture noise, and lighting effects, reducing the visual problem to pure structural/positional information. Unlike the original Atari DQN which stacks 4 consecutive frames to capture motion, this implementation uses single frames only — the network must infer dynamics purely from static positions.

Exploration: Biased Epsilon-Greedy

All agents use biased random action selection during exploration:

def randomAct():
    return 0 if random.random() < probFlap else 1  # flap only 10% of the time

Rather than uniform random actions, the probFlap parameter (default 0.1) encodes domain knowledge that no-flap is the safer action — a form of informed exploration prior. With probability ε, the agent executes this biased random action; otherwise, it selects the greedy action (ties broken by randomAct).

Decay schedules (Robbins-Monro conditions): ε_t = ε₀ / (t + 1) (harmonic epsilon decay), η_t = η · (t+1) / (t+2) (slow learning rate decay).

Training Configuration

Parameter	Tabular	Linear FA	DNN	CNN
Training episodes	10,000	10,000	10,000	10,000
Eval frequency	Every 250	Every 250	Every 250	Every 250
Learning rate	0.8	0.1	0.1	0.1
Discount (γ)	1.0	1.0	1.0	1.0
Optimizer	Direct update	Manual SGD	PyTorch SGD	PyTorch Adam
Replay buffer	—	—	—	50,000
Batch size	—	—	1 (online)	32
Update timing	Once per episode	Once per episode	Once per episode	Every timestep

Results

Configuration	Peak Score	Consistent Performance
Human baseline	20–50 pipes	—
Q-Learning (backward, η=0.8)	2,069 pipes	>1,400 pipes after 10K episodes
SARSA (backward)	~1,800 pipes	>1,200 pipes
CNN-DQN (with replay)	~1,000 pipes	Improving with more training

Key findings: (1) Backward trajectory processing provides the single largest performance boost by efficiently propagating the −1000 death signal. (2) Off-policy Q-Learning outperforms on-policy SARSA — learning the optimal policy regardless of exploration noise. (3) State discretization critically impacts tabular methods — the rounding parameter controls a bias-variance trade-off. (4) CNN agents learn directly from pixels but require higher sample complexity, demonstrating the cost of learning visual features from scratch vs. hand-crafted state representations.

Demo

Tech Stack

Python (3.7+), PyTorch (neural networks + autograd), OpenAI Gym (environment interface), PyGame Learning Environment (Flappy Bird simulator), torchvision (frame preprocessing), OpenCV (binary thresholding), NumPy