AIST-87M GGUF
This repository contains GGUF quantizations of augmem/AIST-87M.
Base model:
augmem/AIST-87M
Quantizations:
AIST-87M_q8_0.ggufAIST-87M_q5_1.gguf
The source model is a compact audio + image + speech + text embedding model for
human-memory augmentation workloads. It is the single-audio evolution of the
earlier dual-audio tower line and uses a merged native mn20_as EfficientAT
audio encoder with no separate runtime LoRA pass.
Evaluation Scope
The quantized files correspond to the same release checkpoint and human-memory evaluation slice as the base repo.
| Dim | Tasks | Text continuity | Image recall | Audio recall | Overall |
|---|---|---|---|---|---|
| 1280 | 8 / 8 | 0.763 | 0.425 | 0.104 | 0.349 |
| 768 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
| 512 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
Primary metrics are main_score for text continuity tasks and NDCG@10 for
image/audio retrieval tasks.
Runtime Footprint vs Dual-Audio Tower
The base AIST-87M release replaces the dual-audio tower's separate
EfficientAT + Whisper-Tiny branches with one merged native mn20_as
EfficientAT encoder.
| Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---|---|---|
| Loaded parameters | 87,118,774 | 95,315,959 | -8.6% |
| Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% |
| Audio encoders | 1 | 2 | removes Whisper branch |
| Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% |
| Audio projection input width | 1,280 | 2,304 | -44.4% |
Exact-gate tradeoff at 1280d against the same dual-audio local baseline:
| Slice | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---|---|---|
| Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 |
| WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 |
| SALT audio-text R@1 avg | 0.008 | 0.007 | flat |
| SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 |
Reference PyTorch audio-stack throughput for the base release was measured on an NVIDIA L4 with synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization.
| Batch | AIST-87M median ms | AIST-87M throughput | AIST-95M median ms | AIST-95M throughput | Speedup |
|---|---|---|---|---|---|
| 1 | 5.36 | 186.7 clips/s; 1,867 audio-s/s | 10.50 | 95.2 clips/s; 952 audio-s/s | 1.96x |
| 8 | 16.46 | 486.0 clips/s; 4,860 audio-s/s | 60.29 | 132.7 clips/s; 1,327 audio-s/s | 3.66x |
| 16 | 41.19 | 388.5 clips/s; 3,885 audio-s/s | 133.95 | 119.4 clips/s; 1,194 audio-s/s | 3.25x |
The GGUF files are quantized distribution artifacts and were not separately
rebenchmarked in a GGUF runtime. Raw PyTorch benchmark output is included as
aist87m_vs_dual_audio_throughput_l4_20260504.json.
Files
| File | Purpose |
|---|---|
AIST-87M_q8_0.gguf |
Higher-accuracy GGUF |
AIST-87M_q5_1.gguf |
Smaller GGUF |
manifest.json |
Release manifest |
parameter_breakdown.json |
Exact parameter accounting |
aist87m_memory_slice_release_report.md |
Human-memory slice report |
aist87m_memory_slice_release_report.json |
Machine-readable evaluation summary |
aist87m_vs_dual_audio_throughput_l4_20260504.json |
Reference L4 throughput benchmark vs dual-audio tower |
Notes
- These are GGUF exports of the same merged-audio release artifact.
- This is not a generic MTEB/MIEB/MAEB leaderboard claim; the reported gate is selected for human-memory embedding workloads.
- Downloads last month
- 133
5-bit
8-bit
Model tree for augmem/AIST-87M-GGUF
Base model
augmem/AIST-87M