vincentoh
/

ctm-experiments

Model card Files Files and versions

xet

Community

vincentoh commited on 22 days ago

Commit

63d759d

verified ·

1 Parent(s): 0480e79

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +212 -159

README.md CHANGED Viewed

@@ -1,197 +1,250 @@
-# CTM Experiments
-Personal experiments with [Continuous Thought Machines](https://github.com/SakanaAI/continuous-thought-machines) (SakanaAI).
-**Interactive Demo**: https://pub.sakana.ai/ctm/
-## Core Insight: Thinking Takes Time
-CTM's key innovation: **accuracy improves with more internal iterations**. The model "thinks longer" to reach better answers.
-This enables CTM to learn algorithmic reasoning that feedforward networks struggle with:
-| Task | Challenge | What CTM Learns |
-|------|-----------|-----------------|
-| **Parity** | Count bits across sequence | Iterative accumulation |
-| **Brackets** | Track nested structure | Stack-like memory (LIFO) |
-| **Object Tracking** | Extrapolate motion | Physics simulation |
-| **Mazes** | Navigate 2D paths | Sequential decision making |
-| **Jigsaw** | Classify shuffled patches | Part-whole integration |
-## Results Summary
-| Experiment | Accuracy | Notes |
-|------------|----------|-------|
-| **MNIST** | **97.9%** | Digit classification, 5 min training |
-| **Parity-16** | **99.0%** | 16-bit cumulative parity |
-| **QAMNIST** | **100%** | Multi-step arithmetic (3-5 digits, 3-5 ops) |
-| **Brackets** | **94.7%** | Stack-like reasoning for `(()[])` vs `([)]` |
-| **Object Tracking** | **100%** | Quadrant prediction from motion (4 classes) |
-| **Velocity Prediction** | **100%** | Direction prediction (9 classes) |
-| **Position Prediction** | **93.8%** | Exact position (256 classes, 16x16 grid) |
-| **Transfer Learning** | **94.5%** | Parity→Brackets (core frozen) |
-| **Maze Solving** | **Visualized** | Pretrained model inference on 15x15 mazes |
-| **Jigsaw MNIST** | **92%** | Classify digits from shuffled patches (no positional encoding) |
-## Key Findings
-### 1. Architecture Matters More Than Scale
-Early experiments showed 50% accuracy on parity (random guessing). The fix wasn't more parameters - it was using the **correct architecture**:
-| Parameter | Wrong | Correct (Official) |
-|-----------|-------|-------------------|
-| `n_synch_out` | 512 | **32** |
-| `n_synch_action` | 512 | **32** |
-| `synapse_depth` | 4 (U-NET) | **1** (linear) |
-The official parity implementation uses surprisingly small synchronization dimensions with a linear synapse - this is critical for learning.
-### 2. "Thinking Longer" = Higher Accuracy
-![MNIST Inference per Tick](continuous-thought-machines/experiments/results/mnist_inference.png)
-CTM accuracy improves with more internal iterations:
-- **Tick 0**: 7% (random)
-- **Tick 10-11**: 100% (peak)
-- **Final tick**: 98%
-Harder tasks need more "thinking time" - parity peaks at tick 35.
-### 3. Transfer Learning Works
-Pretrained parity model transfers to brackets:
-- **Baseline**: 52.5% (random)
-- **After transfer**: 94.5% (core frozen, only backbone/output trained)
-The iterative counting learned for parity transfers to stack tracking for brackets - matching from-scratch performance with only 37.7% of parameters trainable.
-### 4. Maze Solving "The Hard Way"
-CTM solves mazes by outputting action trajectories (Up/Down/Left/Right/Wait), not pixel masks:
-- **Step accuracy**: 60%+ after 2000 iterations
-- Uses auto-extending curriculum (loss only on trajectory up to first error)
-- Demonstrates sequential reasoning capability
-![Maze Attention Overlay](continuous-thought-machines/experiments/results/maze_attention.gif)
-*CTM "thinking" through a 15x15 maze: blue = predicted path, red = attention focus, green = start position. The attention heatmap shows where CTM looks at each internal tick (T=75 iterations).*
-## Detailed Results
-### MNIST Digit Classification (97.9%)
-![MNIST Training Accuracy](continuous-thought-machines/experiments/results/mnist-ctm_smoothed.png)
-CTM learns digit classification in ~5 minutes on RTX 4070 Ti.
-### Parity-16 Cumulative Parity (99.0%)
-![Parity Inference per Tick](continuous-thought-machines/experiments/results/parity_inference.png)
-16-bit parity with cumulative outputs - harder task shows clearer "thinking" benefit.
-### QAMNIST Multi-Step Arithmetic (100%)
-![QAMNIST Training Accuracy](continuous-thought-machines/experiments/results/qamnist-ctm-10_smoothed.png)
-100% accuracy on multi-step arithmetic (3-5 MNIST digits, 3-5 operations) after 300k iterations.
-### Maze Navigation (Pretrained Model)
-Using the authors' pretrained checkpoint (`ctm_mazeslarge_D=2048_T=75_M=25.pt`), we ran inference on the small-mazes dataset:
-- **Model**: D=2048 neurons, T=75 thinking steps, M=25 max trajectory length
-- **Dataset**: 1000 test mazes (15x15 grid)
-- **Output**: Action trajectories (Up/Down/Left/Right/Wait)
-The visualization shows CTM's attention patterns as it navigates:
-1. **Red heatmap**: Where CTM "looks" at each thinking step
-2. **Blue path**: Predicted solution trajectory
-3. **Green marker**: Start position
-Key insight: CTM learns sequential decision-making through iterative internal computation, not memorization.
-### Object Tracking - Position Prediction (93.8%)
-![Position Tracking Training](continuous-thought-machines/experiments/results/tracking_position.png)
-The hardest tracking task: predict exact cell (256 classes) from 5 frames of motion. CTM reaches 93.8% test accuracy, demonstrating temporal reasoning across video frames.
-## Experiment Tracking
-- **Configs**: [`experiments/experiments.json`](continuous-thought-machines/experiments/experiments.json)
-- **Training Scripts**: [`experiments/training/`](continuous-thought-machines/experiments/training/)
-- **Inference Scripts**: [`experiments/inference/`](continuous-thought-machines/experiments/inference/)
-- **Results**: [`experiments/results/`](continuous-thought-machines/experiments/results/)
-## Custom Experiments
-### Bracket Matching
-Classify bracket strings as valid or invalid: `(()[])` vs `([)]`
-Requires tracking nested depth and bracket types - implementing a stack through iterative thinking.
-### Object Tracking
-Predict properties of a moving dot from 5 video frames (16x16 grid).
-```
-Frame 0    Frame 1    Frame 2    Frame 3    Frame 4
-. . . .    . . . .    . . . .    . . . .    . . . .
-. * . .    . . * .    . . . *    . . . .    . . . .
-. . . .    . . . .    . . . .    . . . *    . . . .
-. . . .    . . . .    . . . .    . . . .    . . . *
-```
-Three prediction tasks tested:
-| Task | Classes | Accuracy | Notes |
-|------|---------|----------|-------|
-| **Quadrant** | 4 | 100% | TL/TR/BL/BR - easiest |
-| **Velocity** | 9 | 100% | 8 directions + stationary |
-| **Position** | 256 | 93.8% | Exact cell (16x16) - hardest |
-All tasks converged, demonstrating CTM's ability to learn temporal/spatial reasoning.
-### Transfer Learning
-Freeze core CTM dynamics from parity-16, train only backbone/output for brackets.
-### Maze Inference
-Run pretrained maze model on small-mazes dataset to visualize CTM's "thinking" process:
-```bash
-python -m tasks.mazes.analysis.run \
-  --actions viz \
-  --checkpoint checkpoints/mazes/ctm_mazeslarge_D=2048_T=75_M=25.pt \
-  --dataset_for_viz small-mazes
-```
-Outputs attention overlay GIFs to `tasks/mazes/analysis/outputs/viz/`.
-### Jigsaw MNIST
-Classify MNIST digits from **randomly shuffled patches** without positional encoding.
-```
-Original:        Shuffled (input):
-┌───┬───┬───┬───┐    ┌───┬───┬───┬───┐
-│ 1 │ 2 │ 3 │ 4 │    │12 │ 7 │ 2 │15 │
-├───┼───┼───┼───┤    ├───┼───┼───┼───┤
-│ 5 │ 6 │ 7 │ 8 │ => │ 4 │11 │ 9 │ 1 │
-├───┼───┼───┼───┤    ├───┼───┼───┼───┤
-│ 9 │10 │11 │12 │    │ 6 │ 3 │14 │ 5 │
-├───┼───┼───┼───┤    ├───┼───┼───┼───┤
-│13 │14 │15 │16 │    │16 │ 8 │10 │13 │
-└───┴─��─┴───┴───┘    └───┴───┴───┴───┘
-```
-**Task**: Given 16 shuffled 7x7 patches, predict the digit class (0-9).
-**Challenge**: No positional encoding - CTM must learn to recognize digit parts and integrate them correctly through its internal synchronization dynamics.
-**Result**: **92% test accuracy** - CTM successfully learns part-whole relationships without explicit position information.
-![Jigsaw Training](continuous-thought-machines/experiments/results/jigsaw_training.png)
-## Resources
-- [CTM Paper](2505.05522v4.pdf)
-- [Original SakanaAI Repo](https://github.com/SakanaAI/continuous-thought-machines)

+# CTM Experiments - Continuous Thought Machine Models
+Experimental checkpoints trained on the [Continuous Thought Machine](https://github.com/SakanaAI/continuous-thought-machines) architecture by Sakana AI.
+**These are community experiments on the original work - not official SakanaAI models.**
+## Paper Reference
+> **Continuous Thought Machines**
+>
+> Sakana AI
+>
+> [arXiv:2505.05522](https://arxiv.org/abs/2505.05522)
+>
+> [Interactive Demo](https://pub.sakana.ai/ctm/) | [Blog Post](https://sakana.ai/ctm/)
+```bibtex
+@article{sakana2025ctm,
+  title={Continuous Thought Machines},
+  author={Sakana AI},
+  journal={arXiv preprint arXiv:2505.05522},
+  year={2025}
+}
+```
+## Core Insight
+CTM's key innovation: **accuracy improves with more internal iterations**. The model "thinks longer" to reach better answers. This enables CTM to learn algorithmic reasoning that feedforward networks struggle with.
+## Models
+| Model | File | Size | Task | Accuracy | Description |
+|-------|------|------|------|----------|-------------|
+| MNIST | `ctm-mnist.pt` | 1.3M | Digit classification | 97.9% | 10-class MNIST |
+| Parity-16 | `ctm-parity-16.pt` | 2.5M | Cumulative parity | 99.0% | 16-bit sequences |
+| Parity-64 | `ctm-parity-64.pt` | 66M | Cumulative parity | 58.6% | 64-bit sequences (custom config) |
+| Parity-64 Official | `ctm-parity-64-official.pt` | 21M | Cumulative parity | 57.7% | 64-bit sequences (official config) |
+| QAMNIST | `ctm-qamnist.pt` | 39M | Multi-step arithmetic | 100% | 3-5 digits, 3-5 ops |
+| Brackets | `ctm-brackets.pt` | 6.1M | Bracket matching | 94.7% | Valid/invalid `(()[])` |
+| Tracking-Quadrant | `ctm-tracking-quadrant.pt` | 6.7M | Motion quadrant | 100% | 4-class prediction |
+| Tracking-Position | `ctm-tracking-position.pt` | 6.7M | Exact position | 93.8% | 256-class (16x16 grid) |
+| Transfer | `ctm-transfer-parity-brackets.pt` | 2.5M | Transfer learning | 94.5% | Parity core to brackets |
+| Jigsaw MNIST | `ctm-jigsaw-mnist.pt` | 19M | Jigsaw puzzle solving | 92.3% | Reassemble 2x2 shuffled MNIST |
+| Rotation MNIST | `ctm-rotation-mnist.pt` | 4.2M | Rotation prediction | 89.1% | Predict rotation angle (4 classes) |
+| Brackets Transfer | `ctm-brackets-transfer-depth4.pt` | 6.1M | Transfer learning | 95.1% | Parity→Brackets (depth 4 synapse) |
+| Dual-Task | `ctm-dual-task-brackets-parity.pt` | 2.8M | Multi-task | 86.1% | Brackets (94%) + Parity (78%) jointly |
+| Parity-64 | `ctm-parity-64-8x8.pt` | 4.1M | Long parity | 58.6% | 64-bit (8x8) cumulative parity |
+| Parity-144 | `ctm-parity-144-12x12.pt` | 4.1M | Long parity | 51.7% | 144-bit (12x12) cumulative parity |
+## Model Configurations
+### MNIST CTM
+```python
+config = {
+    "iterations": 15,
+    "memory_length": 10,
+    "d_model": 128,
+    "d_input": 128,
+    "heads": 2,
+    "n_synch_out": 16,
+    "n_synch_action": 16,
+    "memory_hidden_dims": 8,
+    "out_dims": 10,
+    "synapse_depth": 1,
+}
+```
+### Parity-16 CTM
+```python
+config = {
+    "iterations": 50,
+    "memory_length": 25,
+    "d_model": 256,
+    "d_input": 32,
+    "heads": 8,
+    "synapse_depth": 8,
+    "out_dims": 16,  # cumulative parity
+}
+```
+### Parity-64 Official CTM
+```python
+config = {
+    "iterations": 75,
+    "memory_length": 25,
+    "d_model": 1024,
+    "d_input": 64,
+    "heads": 8,
+    "n_synch_out": 32,
+    "n_synch_action": 32,
+    "synapse_depth": 1,  # linear synapse (official)
+    "out_dims": 64,  # cumulative parity
+}
+```
+### QAMNIST CTM
+```python
+config = {
+    "iterations": 10,
+    "memory_length": 30,
+    "d_model": 1024,
+    "d_input": 64,
+    "synapse_depth": 1,
+    "heads": 4,
+    "n_synch_out": 32,
+    "n_synch_action": 32,
+}
+```
+### Brackets CTM
+```python
+config = {
+    "iterations": 30,
+    "memory_length": 15,
+    "d_model": 256,
+    "d_input": 64,
+    "heads": 4,
+    "n_synch_out": 32,
+    "n_synch_action": 32,
+    "out_dims": 2,  # valid/invalid
+}
+```
+### Tracking CTM
+```python
+config = {
+    "iterations": 20,
+    "memory_length": 15,
+    "d_model": 256,
+    "d_input": 64,
+    "heads": 4,
+    "n_synch_out": 32,
+    "n_synch_action": 32,
+}
+```
+### Jigsaw MNIST CTM
+```python
+config = {
+    "iterations": 30,
+    "memory_length": 20,
+    "d_model": 512,
+    "d_input": 128,
+    "heads": 8,
+    "n_synch_out": 32,
+    "n_synch_action": 32,
+    "synapse_depth": 1,
+    "out_dims": 24,  # 4 tiles x 6 permutation options
+    "backbone_type": "jigsaw",
+}
+```
+### Rotation MNIST CTM
+```python
+config = {
+    "iterations": 20,
+    "memory_length": 15,
+    "d_model": 256,
+    "d_input": 64,
+    "heads": 4,
+    "n_synch_out": 32,
+    "n_synch_action": 32,
+    "synapse_depth": 1,
+    "out_dims": 4,  # 0°, 90°, 180°, 270°
+    "backbone_type": "rotation",
+}
+```
+## Usage
+```python
+import torch
+from huggingface_hub import hf_hub_download
+# Download model
+model_path = hf_hub_download(
+    repo_id="vincentoh/ctm-experiments",
+    filename="ctm-mnist.pt"
+)
+# Load checkpoint
+checkpoint = torch.load(model_path, map_location="cpu")
+# Initialize CTM with matching config
+from models.ctm import ContinuousThoughtMachine
+model = ContinuousThoughtMachine(**config)
+model.load_state_dict(checkpoint['model_state_dict'])
+model.eval()
+# Inference
+with torch.no_grad():
+    output = model(input_tensor)
+```
+## Training Details
+- **Hardware**: NVIDIA RTX 4070 Ti SUPER
+- **Framework**: PyTorch
+- **Optimizer**: AdamW
+- **Training time**: 5 minutes (MNIST) to 17 hours (QAMNIST)
+## Key Findings
+1. **Architecture > Scale**: Small sync dimensions (32) with linear synapses work better than large/deep variants
+2. **"Thinking Longer" = Higher Accuracy**: CTM accuracy improves with more internal iterations
+3. **Transfer Learning Works**: Parity-trained core transfers to brackets with 94.5% accuracy
+4. **Architectural Limits**: CTM has a ~58% ceiling on 64-bit parity regardless of hyperparameters
+## Parity Scaling Experiments
+We tested CTM on increasingly long parity sequences to find where it breaks down:
+| Sequence | Grid | Accuracy | vs Random | Status |
+|----------|------|----------|-----------|--------|
+| 16 | 4x4 | **99.0%** | +49.0% | ✅ Solved |
+| 36 | 6x6 | **66.3%** | +16.3% | ⚠️ Degraded |
+| 64 | 8x8 | **58.6%** | +8.6% | ❌ Struggling |
+| 64 (official) | 8x8 | **57.7%** | +7.7% | ❌ Same ceiling |
+| 144 | 12x12 | **51.7%** | +1.7% | ❌ Random |
+**Key insight**: The ~58% ceiling for parity-64 is an **architectural limit**, not a hyperparameter issue. Both custom config (d_model=512, synapse_depth=4) and official config (d_model=1024, synapse_depth=1) achieve essentially the same accuracy.
+### Why CTM Fails on Long Parity
+Parity requires **strict sequential computation**: process bit 1 before bit 2 before bit 3... CTM's attention-based "thinking" is fundamentally parallel - all positions attend simultaneously. The model can learn approximate sequential patterns for short sequences (~64 steps), but this breaks down for longer sequences.
+**CTM excels at:**
+- Moderate sequence lengths (< 64 elements)
+- Local dependencies (brackets: track depth, not full history)
+- Parallelizable structure (MNIST: patches contribute independently)
+**CTM struggles with:**
+- Long strict sequential dependencies (parity-144)
+- Tasks requiring O(n) sequential steps where n > ~64
+## License
+MIT License (same as original CTM repository)
+## Acknowledgments
+- [Sakana AI](https://sakana.ai/) for the Continuous Thought Machine architecture
+- Original [CTM Repository](https://github.com/SakanaAI/continuous-thought-machines)
+## Links
+- [Experiment Repository](https://github.com/bigsnarfdude/ctm-experiments)
+- [Original Paper](https://arxiv.org/abs/2505.05522)
+- [Interactive Demo](https://pub.sakana.ai/ctm/)