DynamicVLA — DOM (full fine-tune checkpoint)

A DynamicVLA policy trained on the DOM dataset (hzxie/DOM) for dynamic-object manipulation.

⚠️ Mid-training checkpoint (epoch 9, loss ≈ 0.002). Self-contained and eval-ready — includes normalization buffers — but optimizer/scheduler state is not included (cannot resume optimizer momentum from this file).

Model

Architecture: DynamicVLA = SmolLM2-360M VLM backbone (16 layers) + FastViT vision encoder
- flow-matching action expert (cross-attention bridge, temporal-attention fusion).
This checkpoint is a FULL fine-tune: vision, text, and connector are unfrozen (freeze_* = False) → all 430M parameters are trainable (the stock config freezes the backbone and trains only ~99M; this run trains everything).
Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384×384, cameras opst_cam + wrist_cam.

Training

Hardware: 8× NVIDIA H200.
Effective global batch 1280 = 80/GPU × 8 GPUs × grad-accum 2 (matches the paper's effective batch; the paper used 32× A100 × 40/GPU = 1280).
AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps.
Current state: epoch 9 / 500-cap, train loss ≈ 0.002 (converged; the 500 cap is nominal).

Load / evaluate

Use the DynamicVLA code (https://github.com/hzxie/DynamicVLA):

from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy
policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM")
policy.eval().cuda()

from_pretrained restores the normalization buffers from model.safetensors, so no dataset is needed to load or run inference. For the DOM benchmark, serve it with scripts/inference.py -p <dir> against the Isaac Lab simulations/evaluate.py eval server.

Notes

DOM contains some corrupt/truncated videos; a small local resilience patch in utils/datasets.py (substitute a valid sample on any decode error) is needed to train on the full set, but is not needed to load or evaluate this checkpoint.
Trained from the HF-pretrained SmolLM2-360M init (no prior DynamicVLA checkpoint), i.e. this is the main training, not a fine-tune of a released DynamicVLA model.

Downloads last month: 14

Safetensors

Model size

0.4B params

Tensor type

F32

BF16

Video Preview

Robotics

mickeykang
/

dynamic-vla-DOM

DynamicVLA — DOM (full fine-tune checkpoint)

Model

Training

Load / evaluate

Notes

Dataset used to train mickeykang/dynamic-vla-DOM