ascension-ai — justin-chan(1)

§P1problem

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Slay the Spire is hard for RL agents for three reasons: the observation space is unstructured (cards, relics, intents — all categorical), the action space is large and conditionally legal, and reward is sparse (you only really learn if you survive an act).

Naive observations force the agent to rediscover, over tens of thousands of games, that Gremlin Nob punishes skills or that Cultist scales strength every turn. That's slow, wasteful, and breaks when new content is added.

Live-game training is bottlenecked by simulation speed — even at max Fast Mode, one game costs 30–90 seconds. There is no headless simulator, so throughput scales only by running multiple concurrent game instances on a single machine (~4–8 workers on a 16-core system).

§P2architecture

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Parallel rollout workers feed a central offline trainer via checkpoint-tagged .pt files. Stale rollouts (workers running an older policy than the current checkpoint) are rejected at ingest. The GUI acts as a process supervisor with crash recovery and restart-every cycling.

              ┌────────────────────────────────────────────────┐
              │              tkinter control panel              │
              │   auto-detect HW · worker count · live logs    │
              │   per-instance JVM heap · restart-every cycle  │
              └────────────────────────┬───────────────────────┘
                                       │ spawn + supervise
       ┌───────────────────────────────┼───────────────────────────────┐
       ▼                               ▼                               ▼
┌────────────┐                  ┌────────────┐                  ┌────────────┐
│ worker[0]  │                  │ worker[1]  │       ...        │ worker[N]  │
│ STS+commod │                  │ STS+commod │                  │ STS+commod │
│   policy → │                  │   policy → │                  │   policy → │
│   rollout  │                  │   rollout  │                  │   rollout  │
└─────┬──────┘                  └─────┬──────┘                  └─────┬──────┘
      │  rollout-{ckpt}.pt            │                              │
      └──────────────┬────────────────┴──────────────────────────────────┘
                     ▼
            ┌────────────────────┐         ┌────────────────────────┐
            │  ingest + filter   │ ───▶    │  PPO trainer (offline) │
            │  stale → discard   │         │  clipped surrogate     │
            └────────────────────┘         │  GAE advantages        │
                                           │  target-KL early stop  │
                                           │  BC anchor loss        │
                                           │  adaptive auto-tuning  │
                                           │  boss reward shaping   │
                                           └──────────┬─────────────┘
                                                      ▼
                                            ┌────────────────────┐
                                            │ atomic checkpoint  │
                                            │  → ppo_sts.pt      │
                                            └─────────┬──────────┘
                                                      ▼
                                                (workers reload)

§P3observation encoder (717-d)

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Hand-engineered, structured, dense. Every dimension has a known meaning. 19 monster power slots per monster (all STS1-verified: strength, vulnerable, weakened, artifact, ritual, curl up, thorns, angry, sharp hide, mode shift, enrage, curiosity, intangible, invincible, time warp, beat of death, malleable, life link, regenerate). The full database of all 66 STS enemies is embedded directly into the observation space, so the agent knows enemy patterns from the first encounter. As of Path 2 a 132-dim per-card count vector over the full Ironclad card pool was appended, so the policy sees its exact deck composition — not just an aggregate — which is what makes card removal and upgrade learnable.

dimcomponenttype

0–31player stats — hp, max-hp, energy, gold, block, strength, dex, etc.scalar

32–95hand cards (up to 10 × 6-d card embedding)embedding

96–278monsters — identity, behavior flags, intents, 19 power slots, scaling rulesembedding+flags

279–342draw pile + discard pile profile (counts by type, cost, family)histogram

343–390relic inventory (one-hot over 178 relics, bucketed by tier)one-hot

391–438potion inventory + slot count + brewable hintsone-hot

439–502screen context — what UI state am I in? combat / map / shop / eventone-hot

503–550map path lookahead — next 3 floors of node types + riskgraph

551–584deck profile — synergies, energy curve, redundancyderived

585–716per-card deck count vector — exact composition over the Ironclad pool (Path 2)counts

§P4action space (134 masked)

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

A flat 134-dim action head, with a legal-action mask computed at every step from the live game state. Illegal actions get -∞ logits before softmax — so the policy never wastes capacity on impossible moves.

00–09

end turn / open map / cancel / confirm

10–69

play card[i] on target[j] · 10×6 grid

70–79

potion[k] on target[j] · 5×2 grid

80–99

card-select choices (rewards, upgrades, shop)

100–119

map node selection (next-floor pick)

120–133

event choices · 14 enumerated slots

§P5training

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Behavior cloning warm-start from a hand-coded heuristic (150–200 demo games → 86,297 labeled transitions, 84.95% validation accuracy with label smoothing 0.02). BC is resumable — per-game checkpointing survives STS crashes mid-collection.

PPO from scratch — clipped surrogate objective, GAE advantage estimation (λ=0.95), target-KL early stopping (0.03), and a BC anchor loss that prevents catastrophic forgetting during fine-tuning. Three hyperparameters auto-tune during training: BC coefficient oscillates between 0.001–0.009 based on policy improvement, entropy coefficient adjusts based on normalized entropy to prevent premature collapse, and learning rate reduces when KL divergence exceeds target bounds.

Boss-specific reward shaping for all Act 1–3 bosses: Guardian (phase-aware offensive-mode bonus), Hexaghost (inferno cycling), Slime Boss (split threshold), Bronze Automaton (hyper beam charging), The Champ (phase transitions), and Donu & Deca (priority targeting). Dense per-step shaping also covers gold, relics, HP delta, floor progression, spawner-priority incentives, elite win bonus (+4.0), rest-site upgrade reward (+0.30), and HP-urgency-scaled heal reward (+0.025/hp scaled by missing HP% — healing at 25% HP gives ~0.45, beating the 0.30 upgrade reward; healing at 75% HP gives ~0.15 so upgrade wins).

Network upgraded from 256×256 Tanh MLP (~236K params) to (512, 256, 256) GELU MLP (~504K params) via warm transfer — compatible weights copied exactly, widened layers zero-padded, new layers identity-initialized to preserve learned behavior while adding capacity. Observation expanded 530→585 (8→19 monster power slots), then 585→717 (a 132-d per-card deck count vector, Path 2), each via warm transfer with zero-initialized new inputs so prior behavior is preserved.

Parallel rollout architecture: 4–8 concurrent workers feed a central offline trainer via checkpoint-tagged .pt files. Stale-rollout rejection keeps importance ratios fresh. 19,400+ games collected across 2,410+ update batches. A Tkinter GUI control panel acts as process supervisor — auto-detects hardware, recommends worker counts, streams live logs, manages per-instance JVM heap limits, and supports restart-every cycling to prevent memory growth.

Headless cloud deployment: the same worker + trainer stack runs unattended on a GPU-less GCP c3-standard-22 spot VM (22 vCPU) via a one-shot, idempotent installer, sustaining ~90+ games/hour on 8 instances. Running a GUI-bound, mod-loaded desktop game headless required per-worker Xvfb virtual displays with software OpenGL (a shared display serializes GL ~100× slower), per-worker JVM tmpdirs (shared /tmp triggers LWJGL native-extraction SIGSEGV races), a Java 8 pin (mods silently fail to load on 17+), headless OpenAL wiring, signaling CommunicationMod's READY handshake before the slow torch import (10 s timeout), and a 2 GB heap + 25-game restart to fix a silent OOM that had capped throughput at ~55 games/hour. The run is fully self-healing: a per-worker watchdog relaunches a wedged JVM, a VM-side cron continuously auto-resumes training (with a 10-min heartbeat log) after any death or reboot, and a Cloud Scheduler job restarts the VM after spot preemption — so it trains constantly with no session and only stops when told to.

Path 2 — learned deck-building: the policy now sees its exact deck (717-d per-card count vector) and controls card removal (purge) and upgrade (smith) selection, which were previously heuristic. The flat per-removal/per-upgrade rewards were replaced with potential-based deck-quality shaping (reward Δ of mean card quality, upgrades boosted), so cutting junk and upgrading impactful cards is rewarded and cutting a key card is penalized — context-dependent and symmetric. The trained model was warm-transferred 585→717, de-risked with a light BC anchor of fresh 717-d heuristic demos, and the lr was re-raised (--override-lr) so the zero-initialized deck inputs learn at a useful pace. Progress is tracked with a weight-ratio diagnostic — the mean magnitude of the 132 new input columns vs the original 585 — which has climbed from ~0.14 to ~0.59, confirming the policy is learning to use the deck. Behavior change lags the weight growth, and tuning has stayed deliberately conservative (one BC-anchor change was tried, regressed the noisy sampled training-floor, and was reverted); a clean greedy fixed-seed eval is planned once the ratio reaches ~0.70 for a real read vs the 585-d baseline.

hyperparameters

γ (discount)0.995

λ (GAE)0.95

clip ε0.2

lr (policy)3e-4 (adaptive)

lr (value)1e-3

target KL0.03

entropy coefadaptive

bc anchor0.001–0.009 (auto-tuned)

batch size4096

epochs/update4

label smoothing (BC)0.02

network(512, 256, 256) GELU

§P6next

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

· train the 717-d learned-deck-building model to maturity, then run a fresh fixed-seed eval vs the 585-d baseline — targeting the Act 2 wall and a first win
· migrate the last heuristic strategic screen (shop purchases) into the RL policy
· checkpoint versioning with named snapshots and rollback support for PPO regression detection
· integrate headless simulator for 10–100× throughput over the live-game bottleneck
· extend to additional characters (Silent → Defect → Watcher) and higher ascension levels
· explore transformer / attention-pooled architecture over variable-length sub-vectors (hand, monsters, deck)

./ascension-ai/

§P1problem

§P2architecture

§P3observation encoder (717-d)

§P4action space (134 masked)

§P5training

§P6next

§Ggallery