CHIP-8 in ONNX

A complete CHIP-8 emulator implemented as a pure ONNX computation graph. No custom operators, no execution-provider extensions, no Python in the hot loop β€” the entire CPU lives inside the model. Standard ONNX Runtime 1.26 CPU EP runs it unmodified.

This is not a machine-learning model. There are no weights, no training, no inference in the statistical sense. It is a CPU expressed as a computation graph, because it turns out ONNX has all the primitives a CPU needs: bitwise ops, indexed memory access, conditional dispatch, and a Loop operator that's Turing-complete with the rest of the op set.

Snake title screen rendered by the model

The image above is the output of Run() on chip8_snake_demo.onnx β€” a uint8[90, 32, 64] tensor returned in one call, with no inputs.

Models

File Inputs Outputs Notes
chip8_cpu.onnx RAM + register state + key state + trip count Updated RAM + register state Load any CHIP-8 ROM into RAM, call once per game tick
chip8_snake_demo.onnx (none β€” fully baked) uint8[90, 32, 64] frame stack Single Run() returns a 90-frame movie of the Snake title screen

The two models share the same inner CPU. The demo wraps that CPU in an outer Loop whose body executes 30 instructions per frame and whose scan output is the framebuffer β€” that's how ONNX naturally accumulates "one frame per outer iteration" into a single tensor.

How it works

State

CHIP-8 has 4 KB of RAM, sixteen 8-bit registers, a 12-bit program counter, a 12-bit index register, a tiny stack, two 8-bit timers, and a 64Γ—32 monochrome display. All of it lives in three tensors that flow through the Loop as carried dependencies:

Tensor Shape Dtype Holds
regs [40] int32 V0..VF, I, PC, SP, DT, ST, RNG seed, tick counter, stack[16]
ram [4096] uint8 Program code, font, sprite data, working memory
display [2048] uint8 64Γ—32 framebuffer, one byte per pixel

The Loop body β€” one CHIP-8 instruction per iteration

Each iteration of the inner Loop fetches, decodes, and executes one CHIP-8 instruction.

The dispatch is branchless: every opcode subgraph runs every iteration, and a chain of Where ops at the end picks the one whose pattern matches. This trades wasted work for a flat, regular graph that's much easier to read than a 35-deep nested If ladder β€” and it doesn't actually cost more in practice, because the per-node overhead of ONNX Runtime's Loop is the dominant cost anyway.

The outer structure β€” wrapping the CPU into a movie

Loop in ONNX has two output kinds:

  • Carried outputs β€” values threaded between iterations (here: regs, ram, display).
  • Scan outputs β€” values emitted per iteration and concatenated along a new leading axis (here: the framebuffer).

The movie model exploits scan outputs: one outer iteration = one frame emitted = one row of the final frames tensor. There is no Python loop anywhere in this pipeline; the entire 90-frame animation is produced inside a single sess.run() call.

What's in the model file

A Loop operator wrapping a single GraphProto body. The body has ~600 nodes β€” mostly Gather, ScatterND, BitShift, BitwiseAnd, Equal, and Where. No node is a custom op. The whole chip8_cpu.onnx file is ~40 KB.

Model file structure graph

Usage

Run the bundled demo

import onnxruntime as ort
import numpy as np
from PIL import Image

sess = ort.InferenceSession("chip8_snake_demo.onnx",
                            providers=["CPUExecutionProvider"])
frames, = sess.run(None, {})  # no inputs!

print(frames.shape, frames.dtype)
# (90, 32, 64) uint8

# Save the final frame
final = (frames[-1] > 0).astype(np.uint8) * 255
Image.fromarray(final, mode="L").resize((512, 256)).save("snake_frame.png")

That's the entire usage. No tokenizer, no preprocessing, no postprocessing β€” Run() returns pixels.

Load any CHIP-8 ROM into the generic CPU

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("chip8_cpu.onnx",
                            providers=["CPUExecutionProvider"])

# Initial state
def initial_ram(rom: bytes) -> np.ndarray:
    FONT = bytes.fromhex("F0909090F02060202070F010F080F0F010F010F0"
                         "9090F01010F080F010F0F080F090F0F010204040"
                         "F090F090F0F090F010F0F090F09090E090E090E0"
                         "F0808080F0E0909090E0F080F080F0F080F08080")
    ram = np.zeros(4096, dtype=np.uint8)
    ram[0x50:0x50+80] = np.frombuffer(FONT, dtype=np.uint8)
    ram[0x200:0x200+len(rom)] = np.frombuffer(rom, dtype=np.uint8)
    return ram

regs = np.zeros(40, dtype=np.int32)
regs[17] = 0x200   # PC
regs[21] = 0xAB    # RNG seed
ram = initial_ram(open("snake.ch8", "rb").read())
display = np.zeros(2048, dtype=np.uint8)
keys = np.zeros(16, dtype=np.uint8)

# Run 30 CHIP-8 instructions per tick
for tick in range(60):
    regs, ram, display = sess.run(None, {
        "regs_in": regs,
        "ram_in": ram,
        "display_in": display,
        "keys": keys,
        "trip_count": np.array(30, dtype=np.int64),
    })

# `display` is now a uint8[2048] framebuffer β€” reshape to (32, 64) to view.

A bundled ROM (snake.ch8, public domain) is included so you can try this straight away.

Why this exists

It's a question about what ONNX is. The ONNX operator set, once it grew Loop, If, the Bitwise* family (opset 18) and ScatterND with reduction modes, became Turing-complete in any reasonable sense of the phrase. This model demonstrates the consequence: ONNX Runtime, designed for evaluating neural networks, can also evaluate arbitrary computations β€” including a working game console β€” without modification.

Concretely the project exists to:

  • Probe how far the standard ONNX op set actually goes as a general computation target.
  • Demonstrate that Loop + Scan output give you a clean way to express "run a program for N steps, return one tensor per step" in a single Run() call.
  • Provide a tiny, complete, self-contained reference for anyone who wants to do non-ML things with ONNX.

If you want to play CHIP-8 games, there are a hundred better emulators. If you want to see what happens when you treat ONNX as a programming language, you're in the right place.

Performance

Measured on a Windows ARM64 laptop with ONNX Runtime 1.26 CPU EP, opset 21:

Workload Throughput
CHIP-8 instructions per second ~2,500
Full snake-title demo (90 frames Γ— 30 ipf = 2,700 instructions) ~1.1 s
Inner Loop body ~600 ONNX nodes
Generic CPU model file size ~40 KB
Snake demo model file size ~48 KB (includes ROM)

This is plenty fast for CHIP-8 β€” most CHIP-8 games target 500–1000 Hz CPU and the model handily exceeds that. ONNX-as-a-CPU is not, however, going to be competitive with anything that wants to run a real-time emulator properly; per-node overhead in Loop bodies dominates everything.

What's inside the box

.
β”œβ”€β”€ chip8_cpu.onnx         # Generic CHIP-8 CPU (40 KB)
β”œβ”€β”€ chip8_snake_demo.onnx  # Self-contained Snake-title movie (48 KB)
β”œβ”€β”€ snake.ch8              # Public-domain Snake ROM (1.4 KB)
β”œβ”€β”€ example_output.gif     # What you get when you Run() the demo
└── README.md              # This file

License

Credits

  • CHIP-8 was created by Joseph Weisbecker in 1977 for the COSMAC VIP.
  • snake.ch8 is by John Earnest, CC0.
  • Built with the standard ONNX op set (opset 21) and tested with ONNX Runtime 1.26.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support