@cahlen on Hugging Face: "So I built a multimodal video annotation pipeline in my spare time, as you do.…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

cahlen

posted an update 3 days ago

Post

128

So I built a multimodal video annotation pipeline in my spare time, as you do.

corpus-mill turns any long-form video with people on camera into a time-aligned event corpus across audio, vision, OCR, faces, brand observations, music, and clip-worthy moments. Runs entirely on local GPU because — and I cannot stress this enough — your footage has no business being on someone else's servers.

The honest origin: I needed real multimodal supervision data, the public corpora are weirdly thin once you need per-frame / per-speaker / per-second labels with provenance, so I built one. Then it grew. Then I looked up and it was 30K LOC and ~30 stages and I thought, ok, maybe other people would want this.

Stack is the usual suspects: Whisper-large-v3 (faster-whisper), pyannote-3.1 (which secretly drags in 433 NeMo modules — surprise!), Qwen2.5-VL-7B for vision/OCR/shoppable detection, dlib + YuNet for faces, qwen2.5:7b / qwen3:14b via local Ollama for the LLM passes, chromaprint + PDQ for fingerprinting. Outputs as Parquet + SQLite. Apache 2.0.

There's a Docker compose that works, after I spent a day discovering that faster-whisper wants CUDA 12 cuBLAS while pyannote 4 wants CUDA 13, and the answer is "install both, point LD_LIBRARY_PATH at the cu12 wheels, ship it." That's now baked in. You're welcome.

Spare-time project, bugs are real, fixing them for your specific footage is on you. If you're training multimodal models and want a corpus pipeline you fully control on-prem, this might save you months. If not, the README is at least mildly entertaining.

https://github.com/cahlen/corpus-mill

CompactAI

2 days ago

Def going to use this somewhere🔥

In this post