llamatelemetry Binaries v1.2.0

Pre-compiled CUDA binaries for llamatelemetry — CUDA-first OpenTelemetry Python SDK for LLM inference observability using gen_ai.* semantic conventions.

📦 Available Binaries

Version	File	Size	Target Platform	SHA256
v1.2.0	llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz	1.4 GB	Kaggle 2× Tesla T4, CUDA 12.5	`4af586c4d97c093c1d6e0db5a46b3d472cd1edf4b0d172511be1a4537a288d8c`

🚀 Auto-Download (Recommended)

These binaries are automatically downloaded when you install llamatelemetry:

# Install on Kaggle with GPU T4 × 2
pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0

On first import llamatelemetry, the package will:

Detect your GPU (Tesla T4 required, SM 7.5+)
Verify CUDA availability via llamatelemetry.require_cuda()
Check for cached binaries in ~/.cache/llamatelemetry/
Download from HuggingFace CDN (this repo — fast, ~2–5 MB/s)
Fallback to GitHub Releases if needed
Verify SHA256 checksum: 4af586c4d97c093c1d6e0db5a46b3d472cd1edf4b0d172511be1a4537a288d8c
Extract 13 binaries + libraries to package directory
Configure environment variables

📥 Manual Download

Using huggingface_hub

from huggingface_hub import hf_hub_download

binary_path = hf_hub_download(
    repo_id="waqasm86/llamatelemetry-binaries",
    filename="v1.2.0/llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz",
    cache_dir="/kaggle/working/cache"
)
print(f"Downloaded to: {binary_path}")

Direct Download URL

wget https://huggingface.co/waqasm86/llamatelemetry-binaries/resolve/main/v1.2.0/llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz

Verify Checksum

# Download checksum file
wget https://huggingface.co/waqasm86/llamatelemetry-binaries/resolve/main/v1.2.0/llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz.sha256

# Verify
sha256sum -c llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz.sha256

📊 Build Information

Property	Value
SDK Version	1.2.0
CUDA Version	12.5
Compute Capability	SM 7.5 (Tesla T4)
llama.cpp Version	b7760 (commit 388ce82)
Build Date	2026-02-20
Target Platform	Kaggle dual Tesla T4 GPUs (2× 15 GB VRAM)
Binaries Included	13 (llama-server, llama-cli, llama-bench, etc.)
Libraries	CUDA shared libraries + dependencies

🔧 What's Inside

The binary bundle contains:

Executables (13 binaries)

llama-server — OpenAI-compatible API server
llama-cli — CLI inference tool
llama-bench — Benchmarking utility
llama-quantize — Model quantization tool
llama-tokenize — Tokenizer utility
And 8 more utilities

Shared Libraries

CUDA 12.5 shared libraries
cuBLAS, NCCL dependencies
llama.cpp runtime libraries

🆕 What's New in v1.2.0

GenAI semantic conventions — gen_ai.* OTel attributes (legacy llm.* removed)
GenAI metrics — 5 histogram instruments: token usage, operation duration, TTFT, TPOT, active requests
CUDA hardening — require_cuda(), detect_cuda(), check_gpu_compatibility()
strict_operation_names — validates gen_ai.operation.name values
PermissionError safety — handles restricted PATH entries in CI/container environments
BenchmarkRunner — structured latency-phase benchmarking (prefill/decode/TTFT/TPOT)

💡 Quick Integration

import llamatelemetry
from llamatelemetry import ServerManager, ServerConfig
from llamatelemetry.llama import LlamaCppClient

# 1. Enforce CUDA (v1.2.0)
llamatelemetry.require_cuda()

# 2. Initialize with GenAI metrics
llamatelemetry.init(
    service_name="kaggle-inference",
    otlp_endpoint="http://localhost:4317",
    enable_metrics=True,
    gpu_enrichment=True,
)

# 3. Start server (dual T4)
config = ServerConfig(
    model_path="/kaggle/input/model/gemma-3-4b-Q4_K_M.gguf",
    tensor_split=[0.5, 0.5],
    n_gpu_layers=-1,
    flash_attn=True,
)
server = ServerManager(config)
server.start()

# 4. Instrumented inference
client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
response = client.chat(
    messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

🔗 Links

GitHub Repository: https://github.com/llamatelemetry/llamatelemetry
GitHub Releases: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0
Models Repository: https://huggingface.co/waqasm86/llamatelemetry-models
Kaggle Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md
Integration Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md

🎯 Supported Platforms

Platform	GPU	CUDA	Status
Kaggle Notebooks	2× Tesla T4 (SM 7.5)	12.5	✅ Supported
Google Colab	Tesla T4 (SM 7.5)	12.x	🔄 Planned
Local Workstation	RTX 3000+/4000+	12.x+	🔄 Planned
GPUs < SM 7.5	Any	Any	❌ Not supported

📄 License

MIT License — See LICENSE

🆘 Troubleshooting

CUDA Not Available

import llamatelemetry
llamatelemetry.require_cuda()  # Raises RuntimeError with details

Ensure Kaggle notebook has GPU T4 × 2 accelerator: Settings → Accelerator → GPU T4 x2

Binary Download Fails

Check internet in Kaggle notebook settings
Retry import: import llamatelemetry (automatic retry logic)
Manual download: Use hf_hub_download() above
GitHub fallback: Available at GitHub Releases

PermissionError in Containers

llamatelemetry v1.2.0 handles PermissionError and OSError on PATH scanning gracefully — no action required.

Maintained by: waqasm86 Version: 1.2.0 Last Updated: 2026-02-20 Status: Active

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support