./ascension-ai/
ACTIVE · Nov 2025 — PresentDistributed reinforcement learning system for Slay the Spire with behavior cloning, PPO fine-tuning, learned deck-building, and a self-healing headless cloud deployment.
- [domain]
- single-player deck-building roguelike (Slay the Spire)
- [agent]
- PPO + Behavior Cloning warm-start + adaptive auto-tuning
- [framework]
- PyTorch · Gymnasium · NumPy
- [integration]
- CommunicationMod (live game stdin/stdout JSON)
- [observation]
- 717-d structured vector · 132-d per-card deck count vector · 19 monster power slots · 66-monster knowledge base
- [action space]
- 134 discrete actions (legal-action masked via -∞ logits)
- [deck-building]
- learned (Path 2) — RL-controlled card removal & upgrade; potential-based deck-quality reward; new deck inputs integrating (first-layer weight-ratio ~0.59 of the original inputs, climbing)
- [network]
- 717→512→256→256→{134 logits + 1 value} · GELU · ~571K params · CPU-only
- [warm transfer]
- 585→717 and 530→585 obs expansions via zero-initialized new inputs; earlier 256×256 Tanh → 512×256×256 GELU migration
- [training scale]
- 24,000+ rollout games · 3,200+ PPO updates · 86,297 BC transitions
- [auto-tuning]
- BC coef (0.001–0.009) · entropy coef · learning rate — all adaptive based on policy behavior
- [reward shaping]
- boss-specific shaping (Guardian, Hexaghost, Slime Boss, Automaton, Champ, Donu & Deca) + upgrade reward + HP-urgency heal reward + elite bonus +4.0
- [baseline]
- heuristic avg floor 15.78 · 39% boss conversion · 26% Act 2 rate · BC val acc 84.95%
- [200-game eval]
- avg floor 14.7 · 38.1% boss WR · 69.6% elite WR · 20% Act 2 reach · best floor 46
- [deployment]
- local Windows GUI · or headless GCP c3-standard-22 spot VM (22 vCPU, no GPU) — 8 workers under per-worker Xvfb + software GL, ~90+ games/hr · self-healing (cron auto-resume + Cloud Scheduler preemption restart)
§P1problem
Slay the Spire is hard for RL agents for three reasons: the observation space is unstructured (cards, relics, intents — all categorical), the action space is large and conditionally legal, and reward is sparse (you only really learn if you survive an act).
Naive observations force the agent to rediscover, over tens of thousands of games, that Gremlin Nob punishes skills or that Cultist scales strength every turn. That's slow, wasteful, and breaks when new content is added.
Live-game training is bottlenecked by simulation speed — even at max Fast Mode, one game costs 30–90 seconds. There is no headless simulator, so throughput scales only by running multiple concurrent game instances on a single machine (~4–8 workers on a 16-core system).
§P2architecture
Parallel rollout workers feed a central offline trainer via checkpoint-tagged .pt files. Stale rollouts (workers running an older policy than the current checkpoint) are rejected at ingest. The GUI acts as a process supervisor with crash recovery and restart-every cycling.
┌────────────────────────────────────────────────┐
│ tkinter control panel │
│ auto-detect HW · worker count · live logs │
│ per-instance JVM heap · restart-every cycle │
└────────────────────────┬───────────────────────┘
│ spawn + supervise
┌───────────────────────────────┼───────────────────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ worker[0] │ │ worker[1] │ ... │ worker[N] │
│ STS+commod │ │ STS+commod │ │ STS+commod │
│ policy → │ │ policy → │ │ policy → │
│ rollout │ │ rollout │ │ rollout │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ rollout-{ckpt}.pt │ │
└──────────────┬────────────────┴──────────────────────────────────┘
▼
┌────────────────────┐ ┌────────────────────────┐
│ ingest + filter │ ───▶ │ PPO trainer (offline) │
│ stale → discard │ │ clipped surrogate │
└────────────────────┘ │ GAE advantages │
│ target-KL early stop │
│ BC anchor loss │
│ adaptive auto-tuning │
│ boss reward shaping │
└──────────┬─────────────┘
▼
┌────────────────────┐
│ atomic checkpoint │
│ → ppo_sts.pt │
└─────────┬──────────┘
▼
(workers reload)§P3observation encoder (717-d)
Hand-engineered, structured, dense. Every dimension has a known meaning. 19 monster power slots per monster (all STS1-verified: strength, vulnerable, weakened, artifact, ritual, curl up, thorns, angry, sharp hide, mode shift, enrage, curiosity, intangible, invincible, time warp, beat of death, malleable, life link, regenerate). The full database of all 66 STS enemies is embedded directly into the observation space, so the agent knows enemy patterns from the first encounter. As of Path 2 a 132-dim per-card count vector over the full Ironclad card pool was appended, so the policy sees its exact deck composition — not just an aggregate — which is what makes card removal and upgrade learnable.
§P4action space (134 masked)
A flat 134-dim action head, with a legal-action mask computed at every step from the live game state. Illegal actions get -∞ logits before softmax — so the policy never wastes capacity on impossible moves.
00–09
end turn / open map / cancel / confirm
10–69
play card[i] on target[j] · 10×6 grid
70–79
potion[k] on target[j] · 5×2 grid
80–99
card-select choices (rewards, upgrades, shop)
100–119
map node selection (next-floor pick)
120–133
event choices · 14 enumerated slots
§P5training
Behavior cloning warm-start from a hand-coded heuristic (150–200 demo games → 86,297 labeled transitions, 84.95% validation accuracy with label smoothing 0.02). BC is resumable — per-game checkpointing survives STS crashes mid-collection.
PPO from scratch — clipped surrogate objective, GAE advantage estimation (λ=0.95), target-KL early stopping (0.03), and a BC anchor loss that prevents catastrophic forgetting during fine-tuning. Three hyperparameters auto-tune during training: BC coefficient oscillates between 0.001–0.009 based on policy improvement, entropy coefficient adjusts based on normalized entropy to prevent premature collapse, and learning rate reduces when KL divergence exceeds target bounds.
Boss-specific reward shaping for all Act 1–3 bosses: Guardian (phase-aware offensive-mode bonus), Hexaghost (inferno cycling), Slime Boss (split threshold), Bronze Automaton (hyper beam charging), The Champ (phase transitions), and Donu & Deca (priority targeting). Dense per-step shaping also covers gold, relics, HP delta, floor progression, spawner-priority incentives, elite win bonus (+4.0), rest-site upgrade reward (+0.30), and HP-urgency-scaled heal reward (+0.025/hp scaled by missing HP% — healing at 25% HP gives ~0.45, beating the 0.30 upgrade reward; healing at 75% HP gives ~0.15 so upgrade wins).
Network upgraded from 256×256 Tanh MLP (~236K params) to (512, 256, 256) GELU MLP (~504K params) via warm transfer — compatible weights copied exactly, widened layers zero-padded, new layers identity-initialized to preserve learned behavior while adding capacity. Observation expanded 530→585 (8→19 monster power slots), then 585→717 (a 132-d per-card deck count vector, Path 2), each via warm transfer with zero-initialized new inputs so prior behavior is preserved.
Parallel rollout architecture: 4–8 concurrent workers feed a central offline trainer via checkpoint-tagged .pt files. Stale-rollout rejection keeps importance ratios fresh. 19,400+ games collected across 2,410+ update batches. A Tkinter GUI control panel acts as process supervisor — auto-detects hardware, recommends worker counts, streams live logs, manages per-instance JVM heap limits, and supports restart-every cycling to prevent memory growth.
Headless cloud deployment: the same worker + trainer stack runs unattended on a GPU-less GCP c3-standard-22 spot VM (22 vCPU) via a one-shot, idempotent installer, sustaining ~90+ games/hour on 8 instances. Running a GUI-bound, mod-loaded desktop game headless required per-worker Xvfb virtual displays with software OpenGL (a shared display serializes GL ~100× slower), per-worker JVM tmpdirs (shared /tmp triggers LWJGL native-extraction SIGSEGV races), a Java 8 pin (mods silently fail to load on 17+), headless OpenAL wiring, signaling CommunicationMod's READY handshake before the slow torch import (10 s timeout), and a 2 GB heap + 25-game restart to fix a silent OOM that had capped throughput at ~55 games/hour. The run is fully self-healing: a per-worker watchdog relaunches a wedged JVM, a VM-side cron continuously auto-resumes training (with a 10-min heartbeat log) after any death or reboot, and a Cloud Scheduler job restarts the VM after spot preemption — so it trains constantly with no session and only stops when told to.
Path 2 — learned deck-building: the policy now sees its exact deck (717-d per-card count vector) and controls card removal (purge) and upgrade (smith) selection, which were previously heuristic. The flat per-removal/per-upgrade rewards were replaced with potential-based deck-quality shaping (reward Δ of mean card quality, upgrades boosted), so cutting junk and upgrading impactful cards is rewarded and cutting a key card is penalized — context-dependent and symmetric. The trained model was warm-transferred 585→717, de-risked with a light BC anchor of fresh 717-d heuristic demos, and the lr was re-raised (--override-lr) so the zero-initialized deck inputs learn at a useful pace. Progress is tracked with a weight-ratio diagnostic — the mean magnitude of the 132 new input columns vs the original 585 — which has climbed from ~0.14 to ~0.59, confirming the policy is learning to use the deck. Behavior change lags the weight growth, and tuning has stayed deliberately conservative (one BC-anchor change was tried, regressed the noisy sampled training-floor, and was reverted); a clean greedy fixed-seed eval is planned once the ratio reaches ~0.70 for a real read vs the 585-d baseline.
hyperparameters
§P6next
- · train the 717-d learned-deck-building model to maturity, then run a fresh fixed-seed eval vs the 585-d baseline — targeting the Act 2 wall and a first win
- · migrate the last heuristic strategic screen (shop purchases) into the RL policy
- · checkpoint versioning with named snapshots and rollback support for PPO regression detection
- · integrate headless simulator for 10–100× throughput over the live-game bottleneck
- · extend to additional characters (Silent → Defect → Watcher) and higher ascension levels
- · explore transformer / attention-pooled architecture over variable-length sub-vectors (hand, monsters, deck)
§Ggallery
[img.01] Slay the Spire — the game AscensionAI is trained to play
