AIST-87M GGUF

This repository contains GGUF quantizations of augmem/AIST-87M.

Base model:

augmem/AIST-87M

Quantizations:

AIST-87M_q8_0.gguf
AIST-87M_q5_1.gguf

The source model is a compact audio + image + speech + text embedding model for human-memory augmentation workloads. It is the single-audio evolution of the earlier dual-audio tower line and uses a merged native mn20_as EfficientAT audio encoder with no separate runtime LoRA pass.

Evaluation Scope

The quantized files correspond to the same release checkpoint and human-memory evaluation slice as the base repo.

Dim	Tasks	Text continuity	Image recall	Audio recall	Overall
1280	8 / 8	0.763	0.425	0.104	0.349
768	8 / 8	0.762	0.424	0.104	0.349
512	8 / 8	0.762	0.424	0.104	0.349

Primary metrics are main_score for text continuity tasks and NDCG@10 for image/audio retrieval tasks.

Runtime Footprint vs Dual-Audio Tower

The base AIST-87M release replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny branches with one merged native mn20_as EfficientAT encoder.

Runtime surface	AIST-87M	AIST-95M dual-audio tower	Delta
Loaded parameters	87,118,774	95,315,959	-8.6%
Safetensors artifact	348.9 MB	381.9 MB	-8.6%
Audio encoders	1	2	removes Whisper branch
Audio path parameters incl. projection	32,193,126	40,390,311	-20.3%
Audio projection input width	1,280	2,304	-44.4%

Exact-gate tradeoff at 1280d against the same dual-audio local baseline:

Slice	AIST-87M	AIST-95M dual-audio tower	Delta
Speech holdout audio-text R@1 avg	0.724	0.582	+0.142
WavCaps FSD audio-text R@1 avg	0.097	0.105	-0.009
SALT audio-text R@1 avg	0.008	0.007	flat
SALT image-audio R@1 avg	0.138	0.148	-0.010

Reference PyTorch audio-stack throughput for the base release was measured on an NVIDIA L4 with synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization.

Batch	AIST-87M median ms	AIST-87M throughput	AIST-95M median ms	AIST-95M throughput	Speedup
1	5.36	186.7 clips/s; 1,867 audio-s/s	10.50	95.2 clips/s; 952 audio-s/s	1.96x
8	16.46	486.0 clips/s; 4,860 audio-s/s	60.29	132.7 clips/s; 1,327 audio-s/s	3.66x
16	41.19	388.5 clips/s; 3,885 audio-s/s	133.95	119.4 clips/s; 1,194 audio-s/s	3.25x

The GGUF files are quantized distribution artifacts and were not separately rebenchmarked in a GGUF runtime. Raw PyTorch benchmark output is included as aist87m_vs_dual_audio_throughput_l4_20260504.json.

Files

File	Purpose
`AIST-87M_q8_0.gguf`	Higher-accuracy GGUF
`AIST-87M_q5_1.gguf`	Smaller GGUF
`manifest.json`	Release manifest
`parameter_breakdown.json`	Exact parameter accounting
`aist87m_memory_slice_release_report.md`	Human-memory slice report
`aist87m_memory_slice_release_report.json`	Machine-readable evaluation summary
`aist87m_vs_dual_audio_throughput_l4_20260504.json`	Reference L4 throughput benchmark vs dual-audio tower

Notes

These are GGUF exports of the same merged-audio release artifact.
This is not a generic MTEB/MIEB/MAEB leaderboard claim; the reported gate is selected for human-memory embedding workloads.

Downloads last month: 133

GGUF

Model size

87.2M params

Architecture

triembed

Hardware compatibility

5-bit

8-bit

Model tree for augmem/AIST-87M-GGUF

Base model

augmem/AIST-87M

Quantized

(1)

this model