llamatelemetry Binaries v1.2.0

Pre-compiled CUDA binaries for llamatelemetry β€” CUDA-first OpenTelemetry Python SDK for LLM inference observability using gen_ai.* semantic conventions.

πŸ“¦ Available Binaries

Version File Size Target Platform SHA256
v1.2.0 llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz 1.4 GB Kaggle 2Γ— Tesla T4, CUDA 12.5 4af586c4d97c093c1d6e0db5a46b3d472cd1edf4b0d172511be1a4537a288d8c

πŸš€ Auto-Download (Recommended)

These binaries are automatically downloaded when you install llamatelemetry:

# Install on Kaggle with GPU T4 Γ— 2
pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0

On first import llamatelemetry, the package will:

  1. Detect your GPU (Tesla T4 required, SM 7.5+)
  2. Verify CUDA availability via llamatelemetry.require_cuda()
  3. Check for cached binaries in ~/.cache/llamatelemetry/
  4. Download from HuggingFace CDN (this repo β€” fast, ~2–5 MB/s)
  5. Fallback to GitHub Releases if needed
  6. Verify SHA256 checksum: 4af586c4d97c093c1d6e0db5a46b3d472cd1edf4b0d172511be1a4537a288d8c
  7. Extract 13 binaries + libraries to package directory
  8. Configure environment variables

πŸ“₯ Manual Download

Using huggingface_hub

from huggingface_hub import hf_hub_download

binary_path = hf_hub_download(
    repo_id="waqasm86/llamatelemetry-binaries",
    filename="v1.2.0/llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz",
    cache_dir="/kaggle/working/cache"
)
print(f"Downloaded to: {binary_path}")

Direct Download URL

wget https://huggingface.co/waqasm86/llamatelemetry-binaries/resolve/main/v1.2.0/llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz

Verify Checksum

# Download checksum file
wget https://huggingface.co/waqasm86/llamatelemetry-binaries/resolve/main/v1.2.0/llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz.sha256

# Verify
sha256sum -c llamatelemetry-v1.2.0-cuda12-kaggle-t4x2.tar.gz.sha256

πŸ“Š Build Information

Property Value
SDK Version 1.2.0
CUDA Version 12.5
Compute Capability SM 7.5 (Tesla T4)
llama.cpp Version b7760 (commit 388ce82)
Build Date 2026-02-20
Target Platform Kaggle dual Tesla T4 GPUs (2Γ— 15 GB VRAM)
Binaries Included 13 (llama-server, llama-cli, llama-bench, etc.)
Libraries CUDA shared libraries + dependencies

πŸ”§ What's Inside

The binary bundle contains:

Executables (13 binaries)

  • llama-server β€” OpenAI-compatible API server
  • llama-cli β€” CLI inference tool
  • llama-bench β€” Benchmarking utility
  • llama-quantize β€” Model quantization tool
  • llama-tokenize β€” Tokenizer utility
  • And 8 more utilities

Shared Libraries

  • CUDA 12.5 shared libraries
  • cuBLAS, NCCL dependencies
  • llama.cpp runtime libraries

πŸ†• What's New in v1.2.0

  • GenAI semantic conventions β€” gen_ai.* OTel attributes (legacy llm.* removed)
  • GenAI metrics β€” 5 histogram instruments: token usage, operation duration, TTFT, TPOT, active requests
  • CUDA hardening β€” require_cuda(), detect_cuda(), check_gpu_compatibility()
  • strict_operation_names β€” validates gen_ai.operation.name values
  • PermissionError safety β€” handles restricted PATH entries in CI/container environments
  • BenchmarkRunner β€” structured latency-phase benchmarking (prefill/decode/TTFT/TPOT)

πŸ’‘ Quick Integration

import llamatelemetry
from llamatelemetry import ServerManager, ServerConfig
from llamatelemetry.llama import LlamaCppClient

# 1. Enforce CUDA (v1.2.0)
llamatelemetry.require_cuda()

# 2. Initialize with GenAI metrics
llamatelemetry.init(
    service_name="kaggle-inference",
    otlp_endpoint="http://localhost:4317",
    enable_metrics=True,
    gpu_enrichment=True,
)

# 3. Start server (dual T4)
config = ServerConfig(
    model_path="/kaggle/input/model/gemma-3-4b-Q4_K_M.gguf",
    tensor_split=[0.5, 0.5],
    n_gpu_layers=-1,
    flash_attn=True,
)
server = ServerManager(config)
server.start()

# 4. Instrumented inference
client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
response = client.chat(
    messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

πŸ”— Links

🎯 Supported Platforms

Platform GPU CUDA Status
Kaggle Notebooks 2Γ— Tesla T4 (SM 7.5) 12.5 βœ… Supported
Google Colab Tesla T4 (SM 7.5) 12.x πŸ”„ Planned
Local Workstation RTX 3000+/4000+ 12.x+ πŸ”„ Planned
GPUs < SM 7.5 Any Any ❌ Not supported

πŸ“„ License

MIT License β€” See LICENSE

πŸ†˜ Troubleshooting

CUDA Not Available

import llamatelemetry
llamatelemetry.require_cuda()  # Raises RuntimeError with details

Ensure Kaggle notebook has GPU T4 Γ— 2 accelerator: Settings β†’ Accelerator β†’ GPU T4 x2

Binary Download Fails

  1. Check internet in Kaggle notebook settings
  2. Retry import: import llamatelemetry (automatic retry logic)
  3. Manual download: Use hf_hub_download() above
  4. GitHub fallback: Available at GitHub Releases

PermissionError in Containers

llamatelemetry v1.2.0 handles PermissionError and OSError on PATH scanning gracefully β€” no action required.


Maintained by: waqasm86 Version: 1.2.0 Last Updated: 2026-02-20 Status: Active

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support