CleverHans-Evaluation
Scripts and a small in-domain test set to evaluate Qwen3-Omni variants on Video-MME, LVBench, and audio–video sync (custom JSONL).
What’s in the repo
| Path | Purpose |
|---|---|
setup_env.sh |
Installs Anaconda if conda is missing, then creates video (or CONDA_ENV) and pip-installs eval deps |
setup_data.sh |
Downloads all eval data: Video-MME, LVBench, sync eval media, and VGG-Sound Sync test clips from Rakancorle11/VGGSoundSync (to /opt/dlami/nvme) |
COMMANDS.md |
Copy-paste commands: data download, merge, eval per model/dataset |
data/kto_training_data_v2_test.jsonl |
In-domain sync test (426 lines) |
scripts/*.py |
Download, merge, eval, metrics helpers |
Quick start
git clone https://huggingface.co/Rakancorle11/CleverHans-Evaluation
cd CleverHans-Evaluation
huggingface-cli login # if needed for gated models
chmod +x setup_env.sh setup_data.sh
bash setup_env.sh # on a machine with no conda: downloads Anaconda to ~/anaconda3 first
source ~/anaconda3/etc/profile.d/conda.sh # if this is your first shell after install
conda activate video
bash setup_data.sh # Video-MME, LVBench, sync media, VGG-Sound Sync → /opt/dlami/nvme (~7GB extra for vggsoundsync tarball)
# Then follow COMMANDS.md — you choose which model on which benchmark.
Fresh OS notes: install wget before running (sudo apt install -y wget). System ffmpeg is recommended (sudo apt install -y ffmpeg). Override INSTALL_DIR / ANACONDA_VERSION / CUDA_INDEX_URL via environment variables if needed (see comments in setup_env.sh).
Models (HF IDs)
| Role | Model |
|---|---|
| Vanilla | Qwen/Qwen3-Omni-30B-A3B-Instruct |
| Full SFT (merge / eval base) | Rakancorle11/qwen3omni_full_sft_revised_thinker_key |
| DPO LoRA | Rakancorle11/Qwen3Omni-onpolicy-dpo-lora-w_audio_v2_8632, _v3_8632, _v4_8632, _v5_12075 |
Merge LoRA into a full checkpoint for vLLM with scripts/merge_adapter.py. For transformers-only Video-MME/LVBench you can pass --adapter instead of merging.
Data
- Video-MME / LVBench / Sync eval data: all downloaded by
bash setup_data.sh. - Sync eval media (original oops videos, random-shift videos, extracted audio): pulled from
hasnat79/ual_bench,Rakancorle11/random_shift_video,Rakancorle11/extracted_audiointo/opt/dlami/nvme/video_source/. - VGG-Sound Sync (out-of-domain sync eval):
Rakancorle11/VGGSoundSync→/opt/dlami/nvme/vggsoundsync/(videos/,vggsoundsync.csv,metadata.csv).
Default paths (convention)
Scripts assume a fixed split on every machine:
| What | Where |
|---|---|
Benchmark videos, merged full models, sync video_source/ (original + shifted + audio) |
/opt/dlami/nvme/... |
Eval outputs (eval_results.jsonl, metrics.json, …) |
/home/ubuntu/eval_results/videomme, .../lvbench, .../sync |
Override with --video-dir, --output-dir, --data-root if your layout differs.
Requirements
- Strong GPU(s), ~200GB+ disk for benchmarks + merged weights (add ~7GB if you include VGG-Sound Sync via
setup_data.sh) - vLLM:
--tpmust divide 20 (audio encoder heads); e.g.--tp 4, not 8 setup_env.shuses CUDA 12.4 PyTorch wheels by default; overrideCUDA_INDEX_URLif needed
Notes
- Eval scripts resume from existing
eval_results.jsonl. - In-domain sync: use
--data-rootso paths are not tied to/home/ubuntu/video_source.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support