Instructions to use majentik/garden with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use majentik/garden with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="majentik/garden")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("majentik/garden", dtype="auto") - MLX
How to use majentik/garden with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("majentik/garden") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- vLLM
How to use majentik/garden with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "majentik/garden" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/garden", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/majentik/garden
- SGLang
How to use majentik/garden with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "majentik/garden" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/garden", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "majentik/garden" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/garden", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - MLX LM
How to use majentik/garden with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "majentik/garden" --prompt "Once upon a time"
- Docker Model Runner
How to use majentik/garden with Docker Model Runner:
docker model run hf.co/majentik/garden
majentik β Model Garden
A curated collection of quantized open-weight models with inference-time KV-cache compression. Every model keeps upstream tokenizers and architectures; the only thing we change is how the weights and KV cache are stored during generation.
307 repositories Β· 12 families Β· 6 quantization lanes
What this garden is for
Running bigger models on the laptop you already have. Every release combines a standard weight-quantization format (GGUF, MLX, or AWQ) with one of two KV-cache compressors:
| Compressor | What it does | When to use |
|---|---|---|
| RotorQuant | Rotational isotropic KV-cache compression | Long-context work; 2β4Γ KV memory savings with minimal drift |
| TurboQuant | Turbo-variant targeted at throughput | Short-context, high-throughput serving |
Both compressors are applied at inference time. They compose with any weight-quantized file in this garden β you mix and match.
Families
| Family | Repos | Notes |
|---|---|---|
| Gemma 4 | 127 | E2B / E4B / 26B-A4B / 31B, base + instruct |
| Nemotron | 41 | Nano 4B + Super (Thinking + Base) variants |
| Qwen 3.5 | 28 | 27B dense + 397B-A17B MoE |
| GPT-OSS | 28 | 20B and 120B |
| Voxtral | 24 | ASR + voice chat, 3 sub-families |
| MERaLiON | 30 | 2 (20 repos) and 3 (10 repos) β ASR + multimodal |
| MiniMax M2.7 | 9 | Mixed quantization lanes |
| Mistral Small 4 | 8 | Instruct + reasoning |
| Leanstral | 8 | Distilled Mistral reasoning variant |
| DeepSeek V3.2 | 2 | Mostly upstream, KV-quant wrappers |
Quantization lanes
Every model lands in one or more of these lanes (the README for each repo specifies which):
- GGUF β Q2_K, Q3_K_M, IQ4_XS, Q4_K_M, Q5_K_M, Q8_0. Load with
llama.cpp,ollama,lmstudio, or any GGUF-compatible runtime. - MLX β 2-bit, 4-bit, 8-bit. Targets Apple Silicon. Pip install
mlx-lm, point it at the repo, done. - AWQ β 4-bit and 8-bit. Targets CUDA GPUs with vLLM or autoawq.
Pick a starting point
- I have a MacBook with 16 GB RAM β try a 4-bit MLX variant of Gemma 4 E4B.
- I have a 24 GB GPU β try an AWQ-4bit Qwen3.5-27B with RotorQuant.
- I need 128k context on modest hardware β any GGUF + RotorQuant.
- I want to compare β each repo card links to the corresponding
upstream model, so perplexity drift is one
eval_hf.pyrun away.
What is not in each repo
- Training data. These are quantization-only releases. The base model's training data is upstream's concern; we inherit the upstream license and disclaimers.
- Benchmarks for every axis. We publish per-lane perplexity on WikiText-2 plus family-specific evals (MMLU for reasoning, LibriSpeech for ASR). If you want an axis we haven't measured yet, open a discussion on the relevant repo.
Who we are
majentik publishes these as a side-project to keep our own fleet running cheaply on commodity hardware, and to close the gap between research releases and "can I actually run this tonight". Issues, quant requests, and benchmark PRs are welcome.
Contact
- Discussions: use the Community tab on the specific model repo.
- Hardware donations / compute partnerships: majentik on Hugging Face.
- Everything else: open a discussion on the closest repo, we'll see it.
Versioning
Each repo version tracks upstream@base-model-revision Γ quant-lane.
When upstream ships a new base revision, we re-run the quant lane and
bump the repo version. Card changes (docs, benchmarks) do not bump the
version.
License
Each repo inherits the base model's license, not this
organization-level license. Check the license field in the
repository's card before deploying.