Argus-Lite

Multi-task perception on a single frozen EUPE-ViT-S backbone, adapted from phanerozoic/argus at roughly ¼ the parameter budget.

Architecture

Image → EUPE-ViT-S (frozen, 21M) → shared features
                                    │
                    ┌───────────────┼──────────────┬──────────────┐
                    ▼               ▼              ▼              ▼
              Classification  Segmentation      Depth        Detection
              Linear(384,1K)  BN+Conv(384,150)  DPT-style   Split-tower (384-D)
              385 K params     58 K params     1.54 M params 2.91 M params

Plus correspondence via cosine max on patch tokens (0 params).

Component Params
EUPE-ViT-S backbone (frozen) 21.59 M
Classifier head 0.39 M
Segmentation head 0.06 M
Depth head (DPT decoder) 13.06 M
Detection head 2.91 M
Total ~38.0 M

Roughly the Argus-B system (103 M) parameter count.

Training

All four heads trained on pre-cached ViT-S features produced by a single forward pass over each target dataset. Backbone is frozen throughout.

Head Dataset Input Recipe Result
Classifier ImageNet-1k train 224 px CLS token SGD, lr 30, WD 0, cosine, 30 epochs 82.87 % train top-1 / 79.13 % val top-1 / 95.53 % val top-5
Segmentation ADE20K (20,210 train / 2,000 val) 512 px, 32×32 grid Linear probe (BN + 1×1 conv), AdamW, lr 1e-3, 32 epochs (~40k iters) mIoU 0.419
Depth NYUv2 (32K train / 5K val) 416 px, 4 hooked blocks at strides 4/8/16/32 SILog, AdamW, lr 1e-4 cosine, DPT decoder with reflection-padded 3×3 convs RMSE 0.537
Detection COCO train 2017 (117 K) 768 px, 48×48 grid FCOS targets, AdamW, lr 1e-4, 2 epochs COCO val2017 mAP 27.3 (AP@50 49.6 · AR@100 43.2); RF100-VL AR@100 0.266 (20-domain subset)

Files

cls_head.safetensors          Linear(384, 1000) classifier
seg_head.safetensors          BN + Conv2d(384, 150, 1)
depth_head.safetensors        DPT decoder over 4 hooked ViT-S blocks
det_head.safetensors          SplitTowerHead (feat_dim=384)
argus_lite.py                 ArgusLite class
infer.py                      CLI dispatcher (6 subcommands)

Usage

from argus_lite import ArgusLite

model = ArgusLite.from_pretrained('phanerozoic/argus-lite').cuda().eval()

# Single-image multi-task inference
out = model.perceive('image.jpg')
# out['classification']  {'label': 'tabby', 'score': 0.62, 'top5': [...]}
# out['segmentation']    (512, 512) int array of ADE20K class ids
# out['depth']           (416, 416) float array, metric depth in meters
# out['detection']       [{'box': [x1,y1,x2,y2], 'score': 0.78, 'class_id': 17}, ...]
# out['correspondence']  None  (needs a second image)

# Per-task methods
model.classify('image.jpg', top_k=5)
model.segment('street.jpg')                          # (512, 512)
model.depth('room.jpg')                              # (416, 416) metric meters
model.detect('photo.jpg', score_thresh=0.3)          # list of dicts
model.correspond('a.jpg', 'b.jpg')                   # cosine-max patch matches

# Paired inference with correspondence populated
out = model.perceive('a.jpg', image_b='b.jpg')
# out['correspondence']  {'matches': (1024,), 'scores': (1024,), 'grid': 32}

CLI dispatcher in infer.py (six subcommands):

python infer.py classify   cat.jpg
python infer.py segment    street.jpg --save seg.png
python infer.py depth      room.jpg --save depth.png
python infer.py detect     photo.jpg --thresh 0.3
python infer.py correspond a.jpg b.jpg
python infer.py perceive   image.jpg --second image2.jpg --save out/

Each head runs at its own training resolution: classifier 224 px (CLS token), segmentation 512 px (32×32 patch grid), depth 416 px (26×26 grid, DPT decoder over hooked blocks 2, 5, 8, 11), detection 768 px (48×48 grid). perceive() therefore does four backbone forward passes per image.

Requires argus.py from phanerozoic/argus on sys.path for the DinoVisionTransformer and SplitTowerHead classes.

Cross-domain detection benchmark (RF100-VL subset)

Same 20-domain class-agnostic AR@100 protocol as Argus, evaluated live through the ViT-S backbone at 768 px input.

Model Total params Mean AR@100
Argus+FCOS (ViT-B backbone) 102.1 M 0.251
Argus-Lite (this model) ~26.5 M 0.266
Argus+(current picker, ViT-B) 89.0 M 0.289

Per-domain numbers live in rf100vl_results.json.

Evaluation details

  • Classifier val top-1 is 79.13 %, top-5 95.53 % on 50K ImageNet val 2012 images, using the TensorFlow Models repo's synset-label mapping for ground truth. Above the EUPE-ViT-S paper kNN baseline (78.2).
  • Detection head: COCO val2017 mAP 0.273 (AP@50 0.496, AP@75 0.268, AR@100 0.432). See coco_val_eval.json for the full breakdown including per-size AP.
  • Depth head is a DPT decoder reassembling the four hooked ViT-S block activations (blocks 2, 5, 8, 11) at strides [4, 8, 16, 32], followed by 4 FeatureFusion blocks with residual conv units and a 256-bin depth head. Trained on NYUv2 (32K train / 5K val held-out split) with SILog loss and AdamW lr 1e-4 on a cosine schedule. For reference, the same DPT decoder on EUPE-ViT-B (Argus, 4× larger backbone) reaches 0.391 RMSE on the equivalent split.
  • Segmentation head is a linear probe at 5 epochs; the EUPE-ViT-S paper reports mIoU 0.466 at a much longer schedule.

Source backbone

EUPE-ViT-S from Meta FAIR (arXiv:2603.22387, Zhu et al., March 2026). Three-stage distillation from PEcore-G + PElang-G + DINOv3-H+ via a 1.9 B proxy teacher. License: FAIR Research License (non-commercial).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phanerozoic/argus-lite

Finetuned
(1)
this model

Datasets used to train phanerozoic/argus-lite

Space using phanerozoic/argus-lite 1

Paper for phanerozoic/argus-lite