# Helion-V1.5-XL Deployment Guide ## Table of Contents 1. [Quick Start](#quick-start) 2. [System Requirements](#system-requirements) 3. [Installation Methods](#installation-methods) 4. [Configuration](#configuration) 5. [Deployment Architectures](#deployment-architectures) 6. [Performance Optimization](#performance-optimization) 7. [Monitoring and Logging](#monitoring-and-logging) 8. [Scaling Strategies](#scaling-strategies) 9. [Security Best Practices](#security-best-practices) 10. [Troubleshooting](#troubleshooting) 11. [Production Checklist](#production-checklist) --- ## Quick Start ### Minimal Setup (5 minutes) ```bash # Install dependencies pip install torch>=2.0.0 transformers>=4.35.0 accelerate # Load and run model python -c " from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = 'DeepXR/Helion-V1.5-XL' tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map='auto' ) prompt = 'Explain machine learning in simple terms:' inputs = tokenizer(prompt, return_tensors='pt').to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) " ``` --- ## System Requirements ### Hardware Requirements #### Minimum Configuration - **GPU**: NVIDIA GPU with 12GB VRAM (e.g., RTX 3090, RTX 4080) - **RAM**: 32GB system RAM - **Storage**: 50GB free space - **CPU**: 8-core processor (Intel Xeon or AMD EPYC recommended) - **Precision**: INT4 quantization required #### Recommended Configuration - **GPU**: NVIDIA A100 (40GB/80GB) or H100 - **RAM**: 64GB system RAM - **Storage**: 200GB SSD (NVMe preferred) - **CPU**: 16+ core processor - **Network**: 10Gbps for distributed setups - **Precision**: BF16 for optimal quality #### Production Configuration - **GPU**: 2x A100 80GB or 1x H100 80GB - **RAM**: 128GB+ system RAM - **Storage**: 500GB NVMe SSD - **CPU**: 32+ core processor - **Network**: 25Gbps+ with low latency - **Redundancy**: Load balancer + multiple replicas ### Software Requirements ``` Operating System: Ubuntu 20.04+, Rocky Linux 8+, or similar Python: 3.8 - 3.11 CUDA: 11.8 or 12.1+ cuDNN: 8.9+ NVIDIA Driver: 525+ ``` ### Compatibility Matrix | Component | Minimum | Recommended | Latest Tested | |-----------|---------|-------------|---------------| | PyTorch | 2.0.0 | 2.1.0 | 2.1.2 | | Transformers | 4.35.0 | 4.36.0 | 4.37.0 | | CUDA | 11.8 | 12.1 | 12.3 | | Python | 3.8 | 3.10 | 3.11 | --- ## Installation Methods ### Method 1: Standard Installation ```bash # Create virtual environment python -m venv helion-env source helion-env/bin/activate # On Windows: helion-env\Scripts\activate # Install dependencies pip install --upgrade pip pip install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0 # Verify installation python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" python -c "import transformers; print(f'Transformers version: {transformers.__version__}')" ``` ### Method 2: Docker Deployment ```dockerfile # Dockerfile FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 # Install Python and dependencies RUN apt-get update && apt-get install -y \ python3.10 \ python3-pip \ git \ && rm -rf /var/lib/apt/lists/* # Install PyTorch and transformers RUN pip3 install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 RUN pip3 install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0 # Copy application code WORKDIR /app COPY . /app # Set environment variables ENV TRANSFORMERS_CACHE=/app/cache ENV HF_HOME=/app/cache # Run inference server CMD ["python3", "inference_server.py"] ``` ```bash # Build and run docker build -t helion-v15-xl . docker run --gpus all -p 8000:8000 helion-v15-xl ``` ### Method 3: Kubernetes Deployment ```yaml # deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: helion-v15-xl spec: replicas: 3 selector: matchLabels: app: helion-v15-xl template: metadata: labels: app: helion-v15-xl spec: containers: - name: helion image: deepxr/helion-v15-xl:latest resources: limits: nvidia.com/gpu: 1 memory: "64Gi" cpu: "16" requests: nvidia.com/gpu: 1 memory: "48Gi" cpu: "8" ports: - containerPort: 8000 env: - name: MODEL_ID value: "DeepXR/Helion-V1.5-XL" - name: PRECISION value: "bfloat16" volumeMounts: - name: model-cache mountPath: /cache volumes: - name: model-cache persistentVolumeClaim: claimName: model-cache-pvc --- apiVersion: v1 kind: Service metadata: name: helion-service spec: type: LoadBalancer ports: - port: 80 targetPort: 8000 selector: app: helion-v15-xl ``` ### Method 4: vLLM for Production ```bash # Install vLLM for optimized serving pip install vllm # Run with vLLM python -m vllm.entrypoints.openai.api_server \ --model DeepXR/Helion-V1.5-XL \ --tensor-parallel-size 1 \ --dtype bfloat16 \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 ``` --- ## Configuration ### Environment Variables ```bash # Model configuration export MODEL_ID="DeepXR/Helion-V1.5-XL" export MODEL_PRECISION="bfloat16" export MAX_SEQUENCE_LENGTH=8192 export CACHE_DIR="/path/to/cache" # Performance tuning export CUDA_VISIBLE_DEVICES=0,1 export OMP_NUM_THREADS=8 export TOKENIZERS_PARALLELISM=true # Memory optimization export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512" # Logging export LOG_LEVEL="INFO" export LOG_FILE="/var/log/helion.log" ``` ### Configuration File (config.yaml) ```yaml model: model_id: "DeepXR/Helion-V1.5-XL" precision: "bfloat16" device_map: "auto" load_in_4bit: false load_in_8bit: false generation: max_new_tokens: 512 temperature: 0.7 top_p: 0.9 top_k: 50 repetition_penalty: 1.1 do_sample: true server: host: "0.0.0.0" port: 8000 workers: 4 timeout: 120 max_batch_size: 32 cache: enabled: true directory: "/tmp/helion_cache" max_size_gb: 100 safety: content_filtering: true pii_detection: true rate_limiting: true max_requests_per_minute: 60 monitoring: enabled: true metrics_port: 9090 log_level: "INFO" ``` --- ## Deployment Architectures ### Architecture 1: Single Instance (Development) ``` ┌─────────────┐ │ Client │ └──────┬──────┘ │ v ┌─────────────┐ │ FastAPI │ │ Server │ └──────┬──────┘ │ v ┌─────────────┐ │ Model │ │ (1x A100) │ └─────────────┘ ``` **Use Case**: Development, testing, low-traffic applications **Setup**: ```python # server.py from fastapi import FastAPI from transformers import AutoTokenizer, AutoModelForCausalLM import torch app = FastAPI() model = AutoModelForCausalLM.from_pretrained( "DeepXR/Helion-V1.5-XL", torch_dtype=torch.bfloat16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL") @app.post("/generate") async def generate(prompt: str, max_tokens: int = 512): inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=max_tokens) return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)} # Run: uvicorn server:app --host 0.0.0.0 --port 8000 ``` ### Architecture 2: Load Balanced (Production) ``` ┌─────────────┐ │Load Balancer│ └──────┬──────┘ │ ┌──────────────┼──────────────┐ │ │ │ v v v ┌────────┐ ┌────────┐ ┌────────┐ │Instance│ │Instance│ │Instance│ │ 1 │ │ 2 │ │ 3 │ └────────┘ └────────┘ └────────┘ │ │ │ └──────────────┼──────────────┘ │ v ┌─────────────┐ │ Redis │ │ Cache │ └─────────────┘ ``` **Use Case**: Production applications with high availability ### Architecture 3: Distributed Inference (High Throughput) ``` ┌──────────────┐ │ API Gateway │ └──────┬───────┘ │ ┌──────┴───────┐ │ Job Scheduler│ └──────┬───────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ v v v ┌─────────┐ ┌─────────┐ ┌─────────┐ │ GPU 0-1 │ │ GPU 2-3 │ │ GPU 4-5 │ │ Tensor │ │ Tensor │ │ Tensor │ │Parallel │ │Parallel │ │Parallel │ └─────────┘ └─────────┘ └─────────┘ ``` **Use Case**: Very high throughput, batch processing **Setup with Ray Serve**: ```python import ray from ray import serve from transformers import AutoModelForCausalLM, AutoTokenizer ray.init() serve.start() @serve.deployment(num_replicas=3, ray_actor_options={"num_gpus": 1}) class HelionModel: def __init__(self): self.model = AutoModelForCausalLM.from_pretrained( "DeepXR/Helion-V1.5-XL", torch_dtype=torch.bfloat16, device_map="auto" ) self.tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL") async def __call__(self, request): prompt = await request.json() inputs = self.tokenizer(prompt["text"], return_tensors="pt").to(self.model.device) outputs = self.model.generate(**inputs, max_new_tokens=512) return {"response": self.tokenizer.decode(outputs[0], skip_special_tokens=True)} HelionModel.deploy() ``` --- ## Performance Optimization ### 1. Quantization ```python # 8-bit Quantization from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0 ) model = AutoModelForCausalLM.from_pretrained( "DeepXR/Helion-V1.5-XL", quantization_config=quantization_config, device_map="auto" ) # 4-bit Quantization (Maximum memory savings) quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4" ) ``` ### 2. Flash Attention ```python # Enable Flash Attention 2 model = AutoModelForCausalLM.from_pretrained( "DeepXR/Helion-V1.5-XL", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2" ) ``` ### 3. Compilation with torch.compile ```python # Compile model for faster inference (PyTorch 2.0+) model = torch.compile(model, mode="reduce-overhead") ``` ### 4. KV Cache Optimization ```python # Use cache for faster generation outputs = model.generate( **inputs, max_new_tokens=512, use_cache=True, past_key_values=past_key_values # Reuse from previous generation ) ``` ### 5. Batching ```python # Process multiple prompts in batch prompts = ["Prompt 1", "Prompt 2", "Prompt 3"] inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) # Decode all outputs responses = [tokenizer.decode(out, skip_special_tokens=True) for out in outputs] ``` ### Performance Benchmarks by Configuration | Configuration | Tokens/sec | Latency (ms) | Memory (GB) | Cost Efficiency | |---------------|------------|--------------|-------------|-----------------| | A100 BF16 | 47.3 | 21.1 | 34.2 | Baseline | | A100 INT8 | 89.6 | 11.2 | 17.8 | 1.9x faster | | A100 INT4 | 134.2 | 7.5 | 10.4 | 2.8x faster | | H100 BF16 | 78.1 | 12.8 | 34.2 | 1.65x faster | | H100 INT4 | 218.7 | 4.6 | 10.4 | 4.6x faster | --- ## Monitoring and Logging ### Prometheus Metrics ```python from prometheus_client import Counter, Histogram, Gauge, start_http_server # Metrics request_count = Counter('helion_requests_total', 'Total requests') request_duration = Histogram('helion_request_duration_seconds', 'Request duration') active_requests = Gauge('helion_active_requests', 'Active requests') token_count = Counter('helion_tokens_generated', 'Tokens generated') error_count = Counter('helion_errors_total', 'Total errors', ['error_type']) # Start metrics server start_http_server(9090) ``` ### Structured Logging ```python import logging import json from datetime import datetime class JSONFormatter(logging.Formatter): def format(self, record): log_data = { "timestamp": datetime.utcnow().isoformat(), "level": record.levelname, "message": record.getMessage(), "module": record.module, "function": record.funcName, "line": record.lineno } return json.dumps(log_data) handler = logging.StreamHandler() handler.setFormatter(JSONFormatter()) logger = logging.getLogger() logger.addHandler(handler) logger.setLevel(logging.INFO) ``` ### Health Check Endpoint ```python @app.get("/health") async def health_check(): try: # Check model is loaded assert model is not None # Check GPU is available assert torch.cuda.is_available() # Quick inference test test_input = tokenizer("test", return_tensors="pt").to(model.device) _ = model.generate(**test_input, max_new_tokens=1) return {"status": "healthy", "timestamp": datetime.utcnow().isoformat()} except Exception as e: return {"status": "unhealthy", "error": str(e)}, 503 ``` ### Grafana Dashboard Configuration ```json { "dashboard": { "title": "Helion-V1.5-XL Monitoring", "panels": [ { "title": "Requests per Second", "targets": [{"expr": "rate(helion_requests_total[1m])"}] }, { "title": "Average Latency", "targets": [{"expr": "rate(helion_request_duration_seconds_sum[5m]) / rate(helion_request_duration_seconds_count[5m])"}] }, { "title": "GPU Utilization", "targets": [{"expr": "nvidia_gpu_utilization"}] }, { "title": "GPU Memory Usage", "targets": [{"expr": "nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100"}] } ] } } ``` --- ## Scaling Strategies ### Horizontal Scaling ```bash # Using Kubernetes HPA kubectl autoscale deployment helion-v15-xl \ --min=2 \ --max=10 \ --cpu-percent=70 \ --memory-percent=80 ``` ### Vertical Scaling | Traffic Level | Configuration | Instances | |---------------|---------------|-----------| | Low (< 10 req/s) | 1x A100 40GB, INT8 | 1 | | Medium (10-50 req/s) | 1x A100 80GB, BF16 | 2-3 | | High (50-200 req/s) | 2x A100 80GB, BF16 | 4-6 | | Very High (200+ req/s) | Multiple H100 clusters | 10+ | ### Request Queuing ```python from asyncio import Queue, create_task import asyncio request_queue = Queue(maxsize=100) batch_size = 8 async def batch_processor(): while True: batch = [] for _ in range(batch_size): try: item = await asyncio.wait_for(request_queue.get(), timeout=0.1) batch.append(item) except asyncio.TimeoutError: break if batch: # Process batch prompts = [item["prompt"] for item in batch] inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) # Return results for item, output in zip(batch, outputs): item["future"].set_result(tokenizer.decode(output, skip_special_tokens=True)) # Start background task create_task(batch_processor()) ``` --- ## Security Best Practices ### 1. API Authentication ```python from fastapi import HTTPException, Security from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials security = HTTPBearer() async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)): if credentials.credentials != os.getenv("API_TOKEN"): raise HTTPException(status_code=401, detail="Invalid authentication") return credentials.credentials @app.post("/generate") async def generate(prompt: str, token: str = Security(verify_token)): # Process request pass ``` ### 2. Rate Limiting ```python from slowapi import Limiter, _rate_limit_exceeded_handler from slowapi.util import get_remote_address limiter = Limiter(key_func=get_remote_address) app.state.limiter = limiter app.add_exception_handler(429, _rate_limit_exceeded_handler) @app.post("/generate") @limiter.limit("60/minute") async def generate(request: Request, prompt: str): # Process request pass ``` ### 3. Input Validation ```python from pydantic import BaseModel, Field, validator class GenerationRequest(BaseModel): prompt: str = Field(..., min_length=1, max_length=8000) max_tokens: int = Field(512, ge=1, le=2048) temperature: float = Field(0.7, ge=0.0, le=2.0) @validator('prompt') def validate_prompt(cls, v): # Check for malicious content if any(bad in v.lower() for bad in ['