CleverHans-Evaluation

Scripts and a small in-domain test set to evaluate Qwen3-Omni variants on Video-MME, LVBench, and audio–video sync (custom JSONL).

What’s in the repo

Path	Purpose
`setup_env.sh`	Installs Anaconda if conda is missing, then creates `video` (or `CONDA_ENV`) and pip-installs eval deps
`setup_data.sh`	Downloads all eval data: Video-MME, LVBench, sync eval media, and VGG-Sound Sync test clips from `Rakancorle11/VGGSoundSync` (to `/opt/dlami/nvme`)
`COMMANDS.md`	Copy-paste commands: data download, merge, eval per model/dataset
`data/kto_training_data_v2_test.jsonl`	In-domain sync test (426 lines)
`scripts/*.py`	Download, merge, eval, metrics helpers

Quick start

git clone https://huggingface.co/Rakancorle11/CleverHans-Evaluation
cd CleverHans-Evaluation

huggingface-cli login   # if needed for gated models

chmod +x setup_env.sh setup_data.sh
bash setup_env.sh       # on a machine with no conda: downloads Anaconda to ~/anaconda3 first
source ~/anaconda3/etc/profile.d/conda.sh   # if this is your first shell after install
conda activate video

bash setup_data.sh      # Video-MME, LVBench, sync media, VGG-Sound Sync → /opt/dlami/nvme (~7GB extra for vggsoundsync tarball)

# Then follow COMMANDS.md — you choose which model on which benchmark.

Fresh OS notes: install wget before running (sudo apt install -y wget). System ffmpeg is recommended (sudo apt install -y ffmpeg). Override INSTALL_DIR / ANACONDA_VERSION / CUDA_INDEX_URL via environment variables if needed (see comments in setup_env.sh).

Models (HF IDs)

Role	Model
Vanilla	`Qwen/Qwen3-Omni-30B-A3B-Instruct`
Full SFT (merge / eval base)	`Rakancorle11/qwen3omni_full_sft_revised_thinker_key`
DPO LoRA	`Rakancorle11/Qwen3Omni-onpolicy-dpo-lora-w_audio_v2_8632`, `_v3_8632`, `_v4_8632`, `_v5_12075`

Merge LoRA into a full checkpoint for vLLM with scripts/merge_adapter.py. For transformers-only Video-MME/LVBench you can pass --adapter instead of merging.

Data

Video-MME / LVBench / Sync eval data: all downloaded by bash setup_data.sh.
Sync eval media (original oops videos, random-shift videos, extracted audio): pulled from hasnat79/ual_bench, Rakancorle11/random_shift_video, Rakancorle11/extracted_audio into /opt/dlami/nvme/video_source/.
VGG-Sound Sync (out-of-domain sync eval): Rakancorle11/VGGSoundSync → /opt/dlami/nvme/vggsoundsync/ (videos/, vggsoundsync.csv, metadata.csv).

Default paths (convention)

Scripts assume a fixed split on every machine:

What	Where
Benchmark videos, merged full models, sync `video_source/` (original + shifted + audio)	`/opt/dlami/nvme/...`
Eval outputs (`eval_results.jsonl`, `metrics.json`, …)	`/home/ubuntu/eval_results/videomme`, `.../lvbench`, `.../sync`

Override with --video-dir, --output-dir, --data-root if your layout differs.

Requirements

Strong GPU(s), ~200GB+ disk for benchmarks + merged weights (add ~7GB if you include VGG-Sound Sync via setup_data.sh)
vLLM: --tp must divide 20 (audio encoder heads); e.g. --tp 4, not 8
setup_env.sh uses CUDA 12.4 PyTorch wheels by default; override CUDA_INDEX_URL if needed

Notes

Eval scripts resume from existing eval_results.jsonl.
In-domain sync: use --data-root so paths are not tied to /home/ubuntu/video_source.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support