Argus-Lite
Multi-task perception on a single frozen EUPE-ViT-S backbone, adapted from phanerozoic/argus at roughly ¼ the parameter budget.
Architecture
Image → EUPE-ViT-S (frozen, 21M) → shared features
│
┌───────────────┼──────────────┬──────────────┐
▼ ▼ ▼ ▼
Classification Segmentation Depth Detection
Linear(384,1K) BN+Conv(384,150) DPT-style Split-tower (384-D)
385 K params 58 K params 1.54 M params 2.91 M params
Plus correspondence via cosine max on patch tokens (0 params).
| Component | Params |
|---|---|
| EUPE-ViT-S backbone (frozen) | 21.59 M |
| Classifier head | 0.39 M |
| Segmentation head | 0.06 M |
| Depth head (DPT decoder) | 13.06 M |
| Detection head | 2.91 M |
| Total | ~38.0 M |
Roughly ⅓ the Argus-B system (103 M) parameter count.
Training
All four heads trained on pre-cached ViT-S features produced by a single forward pass over each target dataset. Backbone is frozen throughout.
| Head | Dataset | Input | Recipe | Result |
|---|---|---|---|---|
| Classifier | ImageNet-1k train | 224 px CLS token | SGD, lr 30, WD 0, cosine, 30 epochs | 82.87 % train top-1 / 79.13 % val top-1 / 95.53 % val top-5 |
| Segmentation | ADE20K (20,210 train / 2,000 val) | 512 px, 32×32 grid | Linear probe (BN + 1×1 conv), AdamW, lr 1e-3, 32 epochs (~40k iters) | mIoU 0.419 |
| Depth | NYUv2 (32K train / 5K val) | 416 px, 4 hooked blocks at strides 4/8/16/32 | SILog, AdamW, lr 1e-4 cosine, DPT decoder with reflection-padded 3×3 convs | RMSE 0.537 |
| Detection | COCO train 2017 (117 K) | 768 px, 48×48 grid | FCOS targets, AdamW, lr 1e-4, 2 epochs | COCO val2017 mAP 27.3 (AP@50 49.6 · AR@100 43.2); RF100-VL AR@100 0.266 (20-domain subset) |
Files
cls_head.safetensors Linear(384, 1000) classifier
seg_head.safetensors BN + Conv2d(384, 150, 1)
depth_head.safetensors DPT decoder over 4 hooked ViT-S blocks
det_head.safetensors SplitTowerHead (feat_dim=384)
argus_lite.py ArgusLite class
infer.py CLI dispatcher (6 subcommands)
Usage
from argus_lite import ArgusLite
model = ArgusLite.from_pretrained('phanerozoic/argus-lite').cuda().eval()
# Single-image multi-task inference
out = model.perceive('image.jpg')
# out['classification'] {'label': 'tabby', 'score': 0.62, 'top5': [...]}
# out['segmentation'] (512, 512) int array of ADE20K class ids
# out['depth'] (416, 416) float array, metric depth in meters
# out['detection'] [{'box': [x1,y1,x2,y2], 'score': 0.78, 'class_id': 17}, ...]
# out['correspondence'] None (needs a second image)
# Per-task methods
model.classify('image.jpg', top_k=5)
model.segment('street.jpg') # (512, 512)
model.depth('room.jpg') # (416, 416) metric meters
model.detect('photo.jpg', score_thresh=0.3) # list of dicts
model.correspond('a.jpg', 'b.jpg') # cosine-max patch matches
# Paired inference with correspondence populated
out = model.perceive('a.jpg', image_b='b.jpg')
# out['correspondence'] {'matches': (1024,), 'scores': (1024,), 'grid': 32}
CLI dispatcher in infer.py (six subcommands):
python infer.py classify cat.jpg
python infer.py segment street.jpg --save seg.png
python infer.py depth room.jpg --save depth.png
python infer.py detect photo.jpg --thresh 0.3
python infer.py correspond a.jpg b.jpg
python infer.py perceive image.jpg --second image2.jpg --save out/
Each head runs at its own training resolution: classifier 224 px (CLS token), segmentation 512 px (32×32 patch grid), depth 416 px (26×26 grid, DPT decoder over hooked blocks 2, 5, 8, 11), detection 768 px (48×48 grid). perceive() therefore does four backbone forward passes per image.
Requires argus.py from phanerozoic/argus on sys.path for the DinoVisionTransformer and SplitTowerHead classes.
Cross-domain detection benchmark (RF100-VL subset)
Same 20-domain class-agnostic AR@100 protocol as Argus, evaluated live through the ViT-S backbone at 768 px input.
| Model | Total params | Mean AR@100 |
|---|---|---|
| Argus+FCOS (ViT-B backbone) | 102.1 M | 0.251 |
| Argus-Lite (this model) | ~26.5 M | 0.266 |
| Argus+(current picker, ViT-B) | 89.0 M | 0.289 |
Per-domain numbers live in rf100vl_results.json.
Evaluation details
- Classifier val top-1 is 79.13 %, top-5 95.53 % on 50K ImageNet val 2012 images, using the TensorFlow Models repo's synset-label mapping for ground truth. Above the EUPE-ViT-S paper kNN baseline (78.2).
- Detection head: COCO val2017 mAP 0.273 (AP@50 0.496, AP@75 0.268, AR@100 0.432). See
coco_val_eval.jsonfor the full breakdown including per-size AP. - Depth head is a DPT decoder reassembling the four hooked ViT-S block activations (blocks 2, 5, 8, 11) at strides [4, 8, 16, 32], followed by 4 FeatureFusion blocks with residual conv units and a 256-bin depth head. Trained on NYUv2 (32K train / 5K val held-out split) with SILog loss and AdamW lr 1e-4 on a cosine schedule. For reference, the same DPT decoder on EUPE-ViT-B (Argus, 4× larger backbone) reaches 0.391 RMSE on the equivalent split.
- Segmentation head is a linear probe at 5 epochs; the EUPE-ViT-S paper reports mIoU 0.466 at a much longer schedule.
Source backbone
EUPE-ViT-S from Meta FAIR (arXiv:2603.22387, Zhu et al., March 2026). Three-stage distillation from PEcore-G + PElang-G + DINOv3-H+ via a 1.9 B proxy teacher. License: FAIR Research License (non-commercial).
Model tree for phanerozoic/argus-lite
Base model
facebook/EUPE-ViT-S