DynamicVLA โ€” DOM (full fine-tune checkpoint)

A DynamicVLA policy trained on the DOM dataset (hzxie/DOM) for dynamic-object manipulation.

โš ๏ธ Mid-training checkpoint (epoch 9, loss โ‰ˆ 0.002). Self-contained and eval-ready โ€” includes normalization buffers โ€” but optimizer/scheduler state is not included (cannot resume optimizer momentum from this file).

Model

  • Architecture: DynamicVLA = SmolLM2-360M VLM backbone (16 layers) + FastViT vision encoder
    • flow-matching action expert (cross-attention bridge, temporal-attention fusion).
  • This checkpoint is a FULL fine-tune: vision, text, and connector are unfrozen (freeze_* = False) โ†’ all 430M parameters are trainable (the stock config freezes the backbone and trains only ~99M; this run trains everything).
  • Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384ร—384, cameras opst_cam + wrist_cam.

Training

  • Hardware: 8ร— NVIDIA H200.
  • Effective global batch 1280 = 80/GPU ร— 8 GPUs ร— grad-accum 2 (matches the paper's effective batch; the paper used 32ร— A100 ร— 40/GPU = 1280).
  • AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps.
  • Current state: epoch 9 / 500-cap, train loss โ‰ˆ 0.002 (converged; the 500 cap is nominal).

Load / evaluate

Use the DynamicVLA code (https://github.com/hzxie/DynamicVLA):

from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy
policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM")
policy.eval().cuda()

from_pretrained restores the normalization buffers from model.safetensors, so no dataset is needed to load or run inference. For the DOM benchmark, serve it with scripts/inference.py -p <dir> against the Isaac Lab simulations/evaluate.py eval server.

Notes

  • DOM contains some corrupt/truncated videos; a small local resilience patch in utils/datasets.py (substitute a valid sample on any decode error) is needed to train on the full set, but is not needed to load or evaluate this checkpoint.
  • Trained from the HF-pretrained SmolLM2-360M init (no prior DynamicVLA checkpoint), i.e. this is the main training, not a fine-tune of a released DynamicVLA model.
Downloads last month
14
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
BF16
ยท
Video Preview
loading

Dataset used to train mickeykang/dynamic-vla-DOM