CleverHans-Evaluation

Scripts and a small in-domain test set to evaluate Qwen3-Omni variants on Video-MME, LVBench, and audio–video sync (custom JSONL).

What’s in the repo

Path Purpose
setup_env.sh Installs Anaconda if conda is missing, then creates video (or CONDA_ENV) and pip-installs eval deps
setup_data.sh Downloads all eval data: Video-MME, LVBench, sync eval media, and VGG-Sound Sync test clips from Rakancorle11/VGGSoundSync (to /opt/dlami/nvme)
COMMANDS.md Copy-paste commands: data download, merge, eval per model/dataset
data/kto_training_data_v2_test.jsonl In-domain sync test (426 lines)
scripts/*.py Download, merge, eval, metrics helpers

Quick start

git clone https://huggingface.co/Rakancorle11/CleverHans-Evaluation
cd CleverHans-Evaluation

huggingface-cli login   # if needed for gated models

chmod +x setup_env.sh setup_data.sh
bash setup_env.sh       # on a machine with no conda: downloads Anaconda to ~/anaconda3 first
source ~/anaconda3/etc/profile.d/conda.sh   # if this is your first shell after install
conda activate video

bash setup_data.sh      # Video-MME, LVBench, sync media, VGG-Sound Sync → /opt/dlami/nvme (~7GB extra for vggsoundsync tarball)

# Then follow COMMANDS.md — you choose which model on which benchmark.

Fresh OS notes: install wget before running (sudo apt install -y wget). System ffmpeg is recommended (sudo apt install -y ffmpeg). Override INSTALL_DIR / ANACONDA_VERSION / CUDA_INDEX_URL via environment variables if needed (see comments in setup_env.sh).

Models (HF IDs)

Role Model
Vanilla Qwen/Qwen3-Omni-30B-A3B-Instruct
Full SFT (merge / eval base) Rakancorle11/qwen3omni_full_sft_revised_thinker_key
DPO LoRA Rakancorle11/Qwen3Omni-onpolicy-dpo-lora-w_audio_v2_8632, _v3_8632, _v4_8632, _v5_12075

Merge LoRA into a full checkpoint for vLLM with scripts/merge_adapter.py. For transformers-only Video-MME/LVBench you can pass --adapter instead of merging.

Data

  • Video-MME / LVBench / Sync eval data: all downloaded by bash setup_data.sh.
  • Sync eval media (original oops videos, random-shift videos, extracted audio): pulled from hasnat79/ual_bench, Rakancorle11/random_shift_video, Rakancorle11/extracted_audio into /opt/dlami/nvme/video_source/.
  • VGG-Sound Sync (out-of-domain sync eval): Rakancorle11/VGGSoundSync/opt/dlami/nvme/vggsoundsync/ (videos/, vggsoundsync.csv, metadata.csv).

Default paths (convention)

Scripts assume a fixed split on every machine:

What Where
Benchmark videos, merged full models, sync video_source/ (original + shifted + audio) /opt/dlami/nvme/...
Eval outputs (eval_results.jsonl, metrics.json, …) /home/ubuntu/eval_results/videomme, .../lvbench, .../sync

Override with --video-dir, --output-dir, --data-root if your layout differs.

Requirements

  • Strong GPU(s), ~200GB+ disk for benchmarks + merged weights (add ~7GB if you include VGG-Sound Sync via setup_data.sh)
  • vLLM: --tp must divide 20 (audio encoder heads); e.g. --tp 4, not 8
  • setup_env.sh uses CUDA 12.4 PyTorch wheels by default; override CUDA_INDEX_URL if needed

Notes

  • Eval scripts resume from existing eval_results.jsonl.
  • In-domain sync: use --data-root so paths are not tied to /home/ubuntu/video_source.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support