AIST-87M GGUF

This repository contains GGUF quantizations of augmem/AIST-87M.

Base model:

  • augmem/AIST-87M

Quantizations:

  • AIST-87M_q8_0.gguf
  • AIST-87M_q5_1.gguf

The source model is a compact audio + image + speech + text embedding model for human-memory augmentation workloads. It is the single-audio evolution of the earlier dual-audio tower line and uses a merged native mn20_as EfficientAT audio encoder with no separate runtime LoRA pass.

Evaluation Scope

The quantized files correspond to the same release checkpoint and human-memory evaluation slice as the base repo.

Dim Tasks Text continuity Image recall Audio recall Overall
1280 8 / 8 0.763 0.425 0.104 0.349
768 8 / 8 0.762 0.424 0.104 0.349
512 8 / 8 0.762 0.424 0.104 0.349

Primary metrics are main_score for text continuity tasks and NDCG@10 for image/audio retrieval tasks.

Runtime Footprint vs Dual-Audio Tower

The base AIST-87M release replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny branches with one merged native mn20_as EfficientAT encoder.

Runtime surface AIST-87M AIST-95M dual-audio tower Delta
Loaded parameters 87,118,774 95,315,959 -8.6%
Safetensors artifact 348.9 MB 381.9 MB -8.6%
Audio encoders 1 2 removes Whisper branch
Audio path parameters incl. projection 32,193,126 40,390,311 -20.3%
Audio projection input width 1,280 2,304 -44.4%

Exact-gate tradeoff at 1280d against the same dual-audio local baseline:

Slice AIST-87M AIST-95M dual-audio tower Delta
Speech holdout audio-text R@1 avg 0.724 0.582 +0.142
WavCaps FSD audio-text R@1 avg 0.097 0.105 -0.009
SALT audio-text R@1 avg 0.008 0.007 flat
SALT image-audio R@1 avg 0.138 0.148 -0.010

Reference PyTorch audio-stack throughput for the base release was measured on an NVIDIA L4 with synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization.

Batch AIST-87M median ms AIST-87M throughput AIST-95M median ms AIST-95M throughput Speedup
1 5.36 186.7 clips/s; 1,867 audio-s/s 10.50 95.2 clips/s; 952 audio-s/s 1.96x
8 16.46 486.0 clips/s; 4,860 audio-s/s 60.29 132.7 clips/s; 1,327 audio-s/s 3.66x
16 41.19 388.5 clips/s; 3,885 audio-s/s 133.95 119.4 clips/s; 1,194 audio-s/s 3.25x

The GGUF files are quantized distribution artifacts and were not separately rebenchmarked in a GGUF runtime. Raw PyTorch benchmark output is included as aist87m_vs_dual_audio_throughput_l4_20260504.json.

Files

File Purpose
AIST-87M_q8_0.gguf Higher-accuracy GGUF
AIST-87M_q5_1.gguf Smaller GGUF
manifest.json Release manifest
parameter_breakdown.json Exact parameter accounting
aist87m_memory_slice_release_report.md Human-memory slice report
aist87m_memory_slice_release_report.json Machine-readable evaluation summary
aist87m_vs_dual_audio_throughput_l4_20260504.json Reference L4 throughput benchmark vs dual-audio tower

Notes

  • These are GGUF exports of the same merged-audio release artifact.
  • This is not a generic MTEB/MIEB/MAEB leaderboard claim; the reported gate is selected for human-memory embedding workloads.
Downloads last month
133
GGUF
Model size
87.2M params
Architecture
triembed
Hardware compatibility
Log In to add your hardware

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for augmem/AIST-87M-GGUF

Base model

augmem/AIST-87M
Quantized
(1)
this model