Helion-V1.5-XL / deployment_guide.md
Trouter-Library's picture
Create deployment_guide.md
531273f verified

Helion-V1.5-XL Deployment Guide

Table of Contents

  1. Quick Start
  2. System Requirements
  3. Installation Methods
  4. Configuration
  5. Deployment Architectures
  6. Performance Optimization
  7. Monitoring and Logging
  8. Scaling Strategies
  9. Security Best Practices
  10. Troubleshooting
  11. Production Checklist

Quick Start

Minimal Setup (5 minutes)

# Install dependencies
pip install torch>=2.0.0 transformers>=4.35.0 accelerate

# Load and run model
python -c "
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'DeepXR/Helion-V1.5-XL'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map='auto'
)

prompt = 'Explain machine learning in simple terms:'
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
"

System Requirements

Hardware Requirements

Minimum Configuration

  • GPU: NVIDIA GPU with 12GB VRAM (e.g., RTX 3090, RTX 4080)
  • RAM: 32GB system RAM
  • Storage: 50GB free space
  • CPU: 8-core processor (Intel Xeon or AMD EPYC recommended)
  • Precision: INT4 quantization required

Recommended Configuration

  • GPU: NVIDIA A100 (40GB/80GB) or H100
  • RAM: 64GB system RAM
  • Storage: 200GB SSD (NVMe preferred)
  • CPU: 16+ core processor
  • Network: 10Gbps for distributed setups
  • Precision: BF16 for optimal quality

Production Configuration

  • GPU: 2x A100 80GB or 1x H100 80GB
  • RAM: 128GB+ system RAM
  • Storage: 500GB NVMe SSD
  • CPU: 32+ core processor
  • Network: 25Gbps+ with low latency
  • Redundancy: Load balancer + multiple replicas

Software Requirements

Operating System: Ubuntu 20.04+, Rocky Linux 8+, or similar
Python: 3.8 - 3.11
CUDA: 11.8 or 12.1+
cuDNN: 8.9+
NVIDIA Driver: 525+

Compatibility Matrix

Component Minimum Recommended Latest Tested
PyTorch 2.0.0 2.1.0 2.1.2
Transformers 4.35.0 4.36.0 4.37.0
CUDA 11.8 12.1 12.3
Python 3.8 3.10 3.11

Installation Methods

Method 1: Standard Installation

# Create virtual environment
python -m venv helion-env
source helion-env/bin/activate  # On Windows: helion-env\Scripts\activate

# Install dependencies
pip install --upgrade pip
pip install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0

# Verify installation
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import transformers; print(f'Transformers version: {transformers.__version__}')"

Method 2: Docker Deployment

# Dockerfile
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install PyTorch and transformers
RUN pip3 install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip3 install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0

# Copy application code
WORKDIR /app
COPY . /app

# Set environment variables
ENV TRANSFORMERS_CACHE=/app/cache
ENV HF_HOME=/app/cache

# Run inference server
CMD ["python3", "inference_server.py"]
# Build and run
docker build -t helion-v15-xl .
docker run --gpus all -p 8000:8000 helion-v15-xl

Method 3: Kubernetes Deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: helion-v15-xl
spec:
  replicas: 3
  selector:
    matchLabels:
      app: helion-v15-xl
  template:
    metadata:
      labels:
        app: helion-v15-xl
    spec:
      containers:
      - name: helion
        image: deepxr/helion-v15-xl:latest
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "64Gi"
            cpu: "16"
          requests:
            nvidia.com/gpu: 1
            memory: "48Gi"
            cpu: "8"
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_ID
          value: "DeepXR/Helion-V1.5-XL"
        - name: PRECISION
          value: "bfloat16"
        volumeMounts:
        - name: model-cache
          mountPath: /cache
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: helion-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8000
  selector:
    app: helion-v15-xl

Method 4: vLLM for Production

# Install vLLM for optimized serving
pip install vllm

# Run with vLLM
python -m vllm.entrypoints.openai.api_server \
    --model DeepXR/Helion-V1.5-XL \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

Configuration

Environment Variables

# Model configuration
export MODEL_ID="DeepXR/Helion-V1.5-XL"
export MODEL_PRECISION="bfloat16"
export MAX_SEQUENCE_LENGTH=8192
export CACHE_DIR="/path/to/cache"

# Performance tuning
export CUDA_VISIBLE_DEVICES=0,1
export OMP_NUM_THREADS=8
export TOKENIZERS_PARALLELISM=true

# Memory optimization
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"

# Logging
export LOG_LEVEL="INFO"
export LOG_FILE="/var/log/helion.log"

Configuration File (config.yaml)

model:
  model_id: "DeepXR/Helion-V1.5-XL"
  precision: "bfloat16"
  device_map: "auto"
  load_in_4bit: false
  load_in_8bit: false
  
generation:
  max_new_tokens: 512
  temperature: 0.7
  top_p: 0.9
  top_k: 50
  repetition_penalty: 1.1
  do_sample: true
  
server:
  host: "0.0.0.0"
  port: 8000
  workers: 4
  timeout: 120
  max_batch_size: 32
  
cache:
  enabled: true
  directory: "/tmp/helion_cache"
  max_size_gb: 100
  
safety:
  content_filtering: true
  pii_detection: true
  rate_limiting: true
  max_requests_per_minute: 60
  
monitoring:
  enabled: true
  metrics_port: 9090
  log_level: "INFO"

Deployment Architectures

Architecture 1: Single Instance (Development)

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       v
┌─────────────┐
│   FastAPI   │
│   Server    │
└──────┬──────┘
       │
       v
┌─────────────┐
│   Model     │
│  (1x A100)  │
└─────────────┘

Use Case: Development, testing, low-traffic applications

Setup:

# server.py
from fastapi import FastAPI
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

app = FastAPI()
model = AutoModelForCausalLM.from_pretrained(
    "DeepXR/Helion-V1.5-XL",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL")

@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 512):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}

# Run: uvicorn server:app --host 0.0.0.0 --port 8000

Architecture 2: Load Balanced (Production)

                ┌─────────────┐
                │Load Balancer│
                └──────┬──────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
        v              v              v
   ┌────────┐    ┌────────┐    ┌────────┐
   │Instance│    │Instance│    │Instance│
   │   1    │    │   2    │    │   3    │
   └────────┘    └────────┘    └────────┘
        │              │              │
        └──────────────┼──────────────┘
                       │
                       v
                ┌─────────────┐
                │   Redis     │
                │   Cache     │
                └─────────────┘

Use Case: Production applications with high availability

Architecture 3: Distributed Inference (High Throughput)

                    ┌──────────────┐
                    │  API Gateway │
                    └──────┬───────┘
                           │
                    ┌──────┴───────┐
                    │ Job Scheduler│
                    └──────┬───────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        v                  v                  v
   ┌─────────┐        ┌─────────┐        ┌─────────┐
   │ GPU 0-1 │        │ GPU 2-3 │        │ GPU 4-5 │
   │ Tensor  │        │ Tensor  │        │ Tensor  │
   │Parallel │        │Parallel │        │Parallel │
   └─────────┘        └─────────┘        └─────────┘

Use Case: Very high throughput, batch processing

Setup with Ray Serve:

import ray
from ray import serve
from transformers import AutoModelForCausalLM, AutoTokenizer

ray.init()
serve.start()

@serve.deployment(num_replicas=3, ray_actor_options={"num_gpus": 1})
class HelionModel:
    def __init__(self):
        self.model = AutoModelForCausalLM.from_pretrained(
            "DeepXR/Helion-V1.5-XL",
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        self.tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL")
    
    async def __call__(self, request):
        prompt = await request.json()
        inputs = self.tokenizer(prompt["text"], return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(**inputs, max_new_tokens=512)
        return {"response": self.tokenizer.decode(outputs[0], skip_special_tokens=True)}

HelionModel.deploy()

Performance Optimization

1. Quantization

# 8-bit Quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model = AutoModelForCausalLM.from_pretrained(
    "DeepXR/Helion-V1.5-XL",
    quantization_config=quantization_config,
    device_map="auto"
)

# 4-bit Quantization (Maximum memory savings)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

2. Flash Attention

# Enable Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    "DeepXR/Helion-V1.5-XL",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

3. Compilation with torch.compile

# Compile model for faster inference (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")

4. KV Cache Optimization

# Use cache for faster generation
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    past_key_values=past_key_values  # Reuse from previous generation
)

5. Batching

# Process multiple prompts in batch
prompts = ["Prompt 1", "Prompt 2", "Prompt 3"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)

# Decode all outputs
responses = [tokenizer.decode(out, skip_special_tokens=True) for out in outputs]

Performance Benchmarks by Configuration

Configuration Tokens/sec Latency (ms) Memory (GB) Cost Efficiency
A100 BF16 47.3 21.1 34.2 Baseline
A100 INT8 89.6 11.2 17.8 1.9x faster
A100 INT4 134.2 7.5 10.4 2.8x faster
H100 BF16 78.1 12.8 34.2 1.65x faster
H100 INT4 218.7 4.6 10.4 4.6x faster

Monitoring and Logging

Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Metrics
request_count = Counter('helion_requests_total', 'Total requests')
request_duration = Histogram('helion_request_duration_seconds', 'Request duration')
active_requests = Gauge('helion_active_requests', 'Active requests')
token_count = Counter('helion_tokens_generated', 'Tokens generated')
error_count = Counter('helion_errors_total', 'Total errors', ['error_type'])

# Start metrics server
start_http_server(9090)

Structured Logging

import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
            "module": record.module,
            "function": record.funcName,
            "line": record.lineno
        }
        return json.dumps(log_data)

handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger()
logger.addHandler(handler)
logger.setLevel(logging.INFO)

Health Check Endpoint

@app.get("/health")
async def health_check():
    try:
        # Check model is loaded
        assert model is not None
        # Check GPU is available
        assert torch.cuda.is_available()
        # Quick inference test
        test_input = tokenizer("test", return_tensors="pt").to(model.device)
        _ = model.generate(**test_input, max_new_tokens=1)
        return {"status": "healthy", "timestamp": datetime.utcnow().isoformat()}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}, 503

Grafana Dashboard Configuration

{
  "dashboard": {
    "title": "Helion-V1.5-XL Monitoring",
    "panels": [
      {
        "title": "Requests per Second",
        "targets": [{"expr": "rate(helion_requests_total[1m])"}]
      },
      {
        "title": "Average Latency",
        "targets": [{"expr": "rate(helion_request_duration_seconds_sum[5m]) / rate(helion_request_duration_seconds_count[5m])"}]
      },
      {
        "title": "GPU Utilization",
        "targets": [{"expr": "nvidia_gpu_utilization"}]
      },
      {
        "title": "GPU Memory Usage",
        "targets": [{"expr": "nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100"}]
      }
    ]
  }
}

Scaling Strategies

Horizontal Scaling

# Using Kubernetes HPA
kubectl autoscale deployment helion-v15-xl \
  --min=2 \
  --max=10 \
  --cpu-percent=70 \
  --memory-percent=80

Vertical Scaling

Traffic Level Configuration Instances
Low (< 10 req/s) 1x A100 40GB, INT8 1
Medium (10-50 req/s) 1x A100 80GB, BF16 2-3
High (50-200 req/s) 2x A100 80GB, BF16 4-6
Very High (200+ req/s) Multiple H100 clusters 10+

Request Queuing

from asyncio import Queue, create_task
import asyncio

request_queue = Queue(maxsize=100)
batch_size = 8

async def batch_processor():
    while True:
        batch = []
        for _ in range(batch_size):
            try:
                item = await asyncio.wait_for(request_queue.get(), timeout=0.1)
                batch.append(item)
            except asyncio.TimeoutError:
                break
        
        if batch:
            # Process batch
            prompts = [item["prompt"] for item in batch]
            inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
            outputs = model.generate(**inputs, max_new_tokens=256)
            
            # Return results
            for item, output in zip(batch, outputs):
                item["future"].set_result(tokenizer.decode(output, skip_special_tokens=True))

# Start background task
create_task(batch_processor())

Security Best Practices

1. API Authentication

from fastapi import HTTPException, Security
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    if credentials.credentials != os.getenv("API_TOKEN"):
        raise HTTPException(status_code=401, detail="Invalid authentication")
    return credentials.credentials

@app.post("/generate")
async def generate(prompt: str, token: str = Security(verify_token)):
    # Process request
    pass

2. Rate Limiting

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)

@app.post("/generate")
@limiter.limit("60/minute")
async def generate(request: Request, prompt: str):
    # Process request
    pass

3. Input Validation

from pydantic import BaseModel, Field, validator

class GenerationRequest(BaseModel):
    prompt: str = Field(..., min_length=1, max_length=8000)
    max_tokens: int = Field(512, ge=1, le=2048)
    temperature: float = Field(0.7, ge=0.0, le=2.0)
    
    @validator('prompt')
    def validate_prompt(cls, v):
        # Check for malicious content
        if any(bad in v.lower() for bad in ['<script>', 'DROP TABLE']):
            raise ValueError('Invalid prompt content')
        return v

4. Content Filtering Integration

from safeguard_filters import ContentSafetyFilter, RefusalGenerator

safety_filter = ContentSafetyFilter()
refusal_gen = RefusalGenerator()

@app.post("/generate")
async def generate(request: GenerationRequest):
    # Check input safety
    is_safe, violations = safety_filter.check_input(request.prompt)
    if not is_safe:
        return {"error": refusal_gen.generate_refusal(violations[0])}
    
    # Generate response
    outputs = model.generate(...)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Check output safety
    is_safe, violations = safety_filter.check_output(response)
    if not is_safe:
        response = safety_filter.redact_pii(response)
    
    return {"response": response}

Troubleshooting

Common Issues and Solutions

Issue 1: Out of Memory (OOM)

Symptoms: CUDA out of memory error

Solutions:

# Solution 1: Use quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,  # or load_in_4bit=True
    device_map="auto"
)

# Solution 2: Reduce batch size
# Use batch_size=1 for inference

# Solution 3: Reduce context length
outputs = model.generate(**inputs, max_new_tokens=256)  # Instead of 512

# Solution 4: Clear cache
torch.cuda.empty_cache()

Issue 2: Slow Inference

Symptoms: High latency, low throughput

Solutions:

# Solution 1: Enable Flash Attention
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2"
)

# Solution 2: Use compilation
model = torch.compile(model)

# Solution 3: Use vLLM
# Install: pip install vllm
# Run with vLLM server (much faster)

# Solution 4: Batch requests
# Process multiple requests together

Issue 3: Model Not Loading

Symptoms: Download errors, corruption

Solutions:

# Clear cache
rm -rf ~/.cache/huggingface/

# Download manually
huggingface-cli download DeepXR/Helion-V1.5-XL

# Check disk space
df -h

# Verify CUDA installation
nvidia-smi

Issue 4: Quality Degradation with Quantization

Solutions:

  • Use INT8 instead of INT4
  • Calibrate quantization with representative data
  • Use double quantization: bnb_4bit_use_double_quant=True

Debugging Commands

# Check GPU status
nvidia-smi

# Monitor GPU usage
watch -n 1 nvidia-smi

# Check Python packages
pip list | grep -E "torch|transformers"

# Test CUDA
python -c "import torch; print(torch.cuda.is_available())"

# Memory profiling
python -m memory_profiler your_script.py

# Performance profiling
python -m cProfile -o output.prof your_script.py

Production Checklist

Pre-Deployment

  • Hardware requirements verified
  • Dependencies installed and tested
  • Model downloaded and loaded successfully
  • Inference tested with sample prompts
  • Performance benchmarks meet requirements
  • Memory usage within acceptable limits
  • Safety filters configured and tested
  • API authentication implemented
  • Rate limiting configured
  • Input validation in place
  • Error handling implemented
  • Logging configured
  • Monitoring dashboards set up
  • Health check endpoints working
  • Load testing completed
  • Security audit passed
  • Documentation complete

Post-Deployment

  • Monitor error rates
  • Track latency metrics
  • Monitor GPU utilization
  • Check memory usage trends
  • Review safety violation logs
  • Analyze user feedback
  • Update model if needed
  • Scale based on load
  • Regular security updates
  • Backup configurations
  • Disaster recovery tested
  • Performance optimization ongoing

Maintenance Schedule

Task Frequency Responsibility
Check error logs Daily DevOps
Review performance metrics Daily ML Engineers
Security updates Weekly Security Team
Model evaluation Monthly Data Science
Capacity planning Monthly Infrastructure
Disaster recovery drill Quarterly All Teams
Full system audit Annually External Auditor

Additional Resources

Documentation

Support Channels

  • GitHub Issues: For bug reports and feature requests
  • Community Forum: For general questions and discussions
  • Enterprise Support: For production deployments

Example Projects

  • REST API Server: /examples/rest_api
  • Streaming Interface: /examples/streaming
  • Batch Processing: /examples/batch_processing
  • Fine-tuning: /examples/fine_tuning

Version History

Version Date Changes
1.0.0 2024-11-01 Initial release
1.0.1 2024-11-15 Performance optimizations
1.1.0 2024-12-01 Flash Attention 2 support

Last Updated: 2024-11-10

Maintained By: DeepXR Engineering Team