llamatelemetry Binaries v1.2.0
Pre-compiled CUDA binaries for llamatelemetry β CUDA-first OpenTelemetry Python SDK for LLM inference observability using gen_ai.* semantic conventions.
π¦ Available Binaries
| Version | File | Size | Target Platform | SHA256 |
|---|---|---|---|---|
| v1.2.0 | llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz | 1.4 GB | Kaggle 2Γ Tesla T4, CUDA 12.5 | 4af586c4d97c093c1d6e0db5a46b3d472cd1edf4b0d172511be1a4537a288d8c |
π Auto-Download (Recommended)
These binaries are automatically downloaded when you install llamatelemetry:
# Install on Kaggle with GPU T4 Γ 2
pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0
On first import llamatelemetry, the package will:
- Detect your GPU (Tesla T4 required, SM 7.5+)
- Verify CUDA availability via
llamatelemetry.require_cuda() - Check for cached binaries in
~/.cache/llamatelemetry/ - Download from HuggingFace CDN (this repo β fast, ~2β5 MB/s)
- Fallback to GitHub Releases if needed
- Verify SHA256 checksum:
4af586c4d97c093c1d6e0db5a46b3d472cd1edf4b0d172511be1a4537a288d8c - Extract 13 binaries + libraries to package directory
- Configure environment variables
π₯ Manual Download
Using huggingface_hub
from huggingface_hub import hf_hub_download
binary_path = hf_hub_download(
repo_id="waqasm86/llamatelemetry-binaries",
filename="v1.2.0/llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz",
cache_dir="/kaggle/working/cache"
)
print(f"Downloaded to: {binary_path}")
Direct Download URL
wget https://huggingface.co/waqasm86/llamatelemetry-binaries/resolve/main/v1.2.0/llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz
Verify Checksum
# Download checksum file
wget https://huggingface.co/waqasm86/llamatelemetry-binaries/resolve/main/v1.2.0/llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz.sha256
# Verify
sha256sum -c llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz.sha256
π Build Information
| Property | Value |
|---|---|
| SDK Version | 1.2.0 |
| CUDA Version | 12.5 |
| Compute Capability | SM 7.5 (Tesla T4) |
| llama.cpp Version | b7760 (commit 388ce82) |
| Build Date | 2026-02-20 |
| Target Platform | Kaggle dual Tesla T4 GPUs (2Γ 15 GB VRAM) |
| Binaries Included | 13 (llama-server, llama-cli, llama-bench, etc.) |
| Libraries | CUDA shared libraries + dependencies |
π§ What's Inside
The binary bundle contains:
Executables (13 binaries)
llama-serverβ OpenAI-compatible API serverllama-cliβ CLI inference toolllama-benchβ Benchmarking utilityllama-quantizeβ Model quantization toolllama-tokenizeβ Tokenizer utility- And 8 more utilities
Shared Libraries
- CUDA 12.5 shared libraries
- cuBLAS, NCCL dependencies
- llama.cpp runtime libraries
π What's New in v1.2.0
- GenAI semantic conventions β
gen_ai.*OTel attributes (legacyllm.*removed) - GenAI metrics β 5 histogram instruments: token usage, operation duration, TTFT, TPOT, active requests
- CUDA hardening β
require_cuda(),detect_cuda(),check_gpu_compatibility() - strict_operation_names β validates
gen_ai.operation.namevalues - PermissionError safety β handles restricted PATH entries in CI/container environments
- BenchmarkRunner β structured latency-phase benchmarking (prefill/decode/TTFT/TPOT)
π‘ Quick Integration
import llamatelemetry
from llamatelemetry import ServerManager, ServerConfig
from llamatelemetry.llama import LlamaCppClient
# 1. Enforce CUDA (v1.2.0)
llamatelemetry.require_cuda()
# 2. Initialize with GenAI metrics
llamatelemetry.init(
service_name="kaggle-inference",
otlp_endpoint="http://localhost:4317",
enable_metrics=True,
gpu_enrichment=True,
)
# 3. Start server (dual T4)
config = ServerConfig(
model_path="/kaggle/input/model/gemma-3-4b-Q4_K_M.gguf",
tensor_split=[0.5, 0.5],
n_gpu_layers=-1,
flash_attn=True,
)
server = ServerManager(config)
server.start()
# 4. Instrumented inference
client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
response = client.chat(
messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
max_tokens=512,
)
print(response.choices[0].message.content)
π Links
- GitHub Repository: https://github.com/llamatelemetry/llamatelemetry
- GitHub Releases: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0
- Models Repository: https://huggingface.co/waqasm86/llamatelemetry-models
- Kaggle Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md
- Integration Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md
π― Supported Platforms
| Platform | GPU | CUDA | Status |
|---|---|---|---|
| Kaggle Notebooks | 2Γ Tesla T4 (SM 7.5) | 12.5 | β Supported |
| Google Colab | Tesla T4 (SM 7.5) | 12.x | π Planned |
| Local Workstation | RTX 3000+/4000+ | 12.x+ | π Planned |
| GPUs < SM 7.5 | Any | Any | β Not supported |
π License
MIT License β See LICENSE
π Troubleshooting
CUDA Not Available
import llamatelemetry
llamatelemetry.require_cuda() # Raises RuntimeError with details
Ensure Kaggle notebook has GPU T4 Γ 2 accelerator: Settings β Accelerator β GPU T4 x2
Binary Download Fails
- Check internet in Kaggle notebook settings
- Retry import:
import llamatelemetry(automatic retry logic) - Manual download: Use
hf_hub_download()above - GitHub fallback: Available at GitHub Releases
PermissionError in Containers
llamatelemetry v1.2.0 handles PermissionError and OSError on PATH scanning gracefully β
no action required.
Maintained by: waqasm86 Version: 1.2.0 Last Updated: 2026-02-20 Status: Active
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support