deployment_guide.md · DeepXR/Helion-V1.5-XL at main

File size: 23,739 Bytes

531273f

# Helion-V1.5-XL Deployment Guide

## Table of Contents

1. [Quick Start](#quick-start)
2. [System Requirements](#system-requirements)
3. [Installation Methods](#installation-methods)
4. [Configuration](#configuration)
5. [Deployment Architectures](#deployment-architectures)
6. [Performance Optimization](#performance-optimization)
7. [Monitoring and Logging](#monitoring-and-logging)
8. [Scaling Strategies](#scaling-strategies)
9. [Security Best Practices](#security-best-practices)
10. [Troubleshooting](#troubleshooting)
11. [Production Checklist](#production-checklist)

---

## Quick Start

### Minimal Setup (5 minutes)

```bash
# Install dependencies
pip install torch>=2.0.0 transformers>=4.35.0 accelerate

# Load and run model
python -c "
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'DeepXR/Helion-V1.5-XL'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map='auto'
)

prompt = 'Explain machine learning in simple terms:'
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
"
```

---

## System Requirements

### Hardware Requirements

#### Minimum Configuration
- **GPU**: NVIDIA GPU with 12GB VRAM (e.g., RTX 3090, RTX 4080)
- **RAM**: 32GB system RAM
- **Storage**: 50GB free space
- **CPU**: 8-core processor (Intel Xeon or AMD EPYC recommended)
- **Precision**: INT4 quantization required

#### Recommended Configuration
- **GPU**: NVIDIA A100 (40GB/80GB) or H100
- **RAM**: 64GB system RAM
- **Storage**: 200GB SSD (NVMe preferred)
- **CPU**: 16+ core processor
- **Network**: 10Gbps for distributed setups
- **Precision**: BF16 for optimal quality

#### Production Configuration
- **GPU**: 2x A100 80GB or 1x H100 80GB
- **RAM**: 128GB+ system RAM
- **Storage**: 500GB NVMe SSD
- **CPU**: 32+ core processor
- **Network**: 25Gbps+ with low latency
- **Redundancy**: Load balancer + multiple replicas

### Software Requirements

```
Operating System: Ubuntu 20.04+, Rocky Linux 8+, or similar
Python: 3.8 - 3.11
CUDA: 11.8 or 12.1+
cuDNN: 8.9+
NVIDIA Driver: 525+
```

### Compatibility Matrix

| Component | Minimum | Recommended | Latest Tested |
|-----------|---------|-------------|---------------|
| PyTorch | 2.0.0 | 2.1.0 | 2.1.2 |
| Transformers | 4.35.0 | 4.36.0 | 4.37.0 |
| CUDA | 11.8 | 12.1 | 12.3 |
| Python | 3.8 | 3.10 | 3.11 |

---

## Installation Methods

### Method 1: Standard Installation

```bash
# Create virtual environment
python -m venv helion-env
source helion-env/bin/activate  # On Windows: helion-env\Scripts\activate

# Install dependencies
pip install --upgrade pip
pip install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0

# Verify installation
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import transformers; print(f'Transformers version: {transformers.__version__}')"
```

### Method 2: Docker Deployment

```dockerfile
# Dockerfile
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install PyTorch and transformers
RUN pip3 install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip3 install transformers==4.36.0 accelerate==0.24.0 bitsandbytes==0.41.0

# Copy application code
WORKDIR /app
COPY . /app

# Set environment variables
ENV TRANSFORMERS_CACHE=/app/cache
ENV HF_HOME=/app/cache

# Run inference server
CMD ["python3", "inference_server.py"]
```

```bash
# Build and run
docker build -t helion-v15-xl .
docker run --gpus all -p 8000:8000 helion-v15-xl
```

### Method 3: Kubernetes Deployment

```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: helion-v15-xl
spec:
  replicas: 3
  selector:
    matchLabels:
      app: helion-v15-xl
  template:
    metadata:
      labels:
        app: helion-v15-xl
    spec:
      containers:
      - name: helion
        image: deepxr/helion-v15-xl:latest
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "64Gi"
            cpu: "16"
          requests:
            nvidia.com/gpu: 1
            memory: "48Gi"
            cpu: "8"
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_ID
          value: "DeepXR/Helion-V1.5-XL"
        - name: PRECISION
          value: "bfloat16"
        volumeMounts:
        - name: model-cache
          mountPath: /cache
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: helion-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8000
  selector:
    app: helion-v15-xl
```

### Method 4: vLLM for Production

```bash
# Install vLLM for optimized serving
pip install vllm

# Run with vLLM
python -m vllm.entrypoints.openai.api_server \
    --model DeepXR/Helion-V1.5-XL \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9
```

---

## Configuration

### Environment Variables

```bash
# Model configuration
export MODEL_ID="DeepXR/Helion-V1.5-XL"
export MODEL_PRECISION="bfloat16"
export MAX_SEQUENCE_LENGTH=8192
export CACHE_DIR="/path/to/cache"

# Performance tuning
export CUDA_VISIBLE_DEVICES=0,1
export OMP_NUM_THREADS=8
export TOKENIZERS_PARALLELISM=true

# Memory optimization
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"

# Logging
export LOG_LEVEL="INFO"
export LOG_FILE="/var/log/helion.log"
```

### Configuration File (config.yaml)

```yaml
model:
  model_id: "DeepXR/Helion-V1.5-XL"
  precision: "bfloat16"
  device_map: "auto"
  load_in_4bit: false
  load_in_8bit: false
  
generation:
  max_new_tokens: 512
  temperature: 0.7
  top_p: 0.9
  top_k: 50
  repetition_penalty: 1.1
  do_sample: true
  
server:
  host: "0.0.0.0"
  port: 8000
  workers: 4
  timeout: 120
  max_batch_size: 32
  
cache:
  enabled: true
  directory: "/tmp/helion_cache"
  max_size_gb: 100
  
safety:
  content_filtering: true
  pii_detection: true
  rate_limiting: true
  max_requests_per_minute: 60
  
monitoring:
  enabled: true
  metrics_port: 9090
  log_level: "INFO"
```

---

## Deployment Architectures

### Architecture 1: Single Instance (Development)

```
┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       v
┌─────────────┐
│   FastAPI   │
│   Server    │
└──────┬──────┘
       │
       v
┌─────────────┐
│   Model     │
│  (1x A100)  │
└─────────────┘
```

**Use Case**: Development, testing, low-traffic applications

**Setup**:
```python
# server.py
from fastapi import FastAPI
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

app = FastAPI()
model = AutoModelForCausalLM.from_pretrained(
    "DeepXR/Helion-V1.5-XL",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL")

@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 512):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}

# Run: uvicorn server:app --host 0.0.0.0 --port 8000
```

### Architecture 2: Load Balanced (Production)

```
                ┌─────────────┐
                │Load Balancer│
                └──────┬──────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
        v              v              v
   ┌────────┐    ┌────────┐    ┌────────┐
   │Instance│    │Instance│    │Instance│
   │   1    │    │   2    │    │   3    │
   └────────┘    └────────┘    └────────┘
        │              │              │
        └──────────────┼──────────────┘
                       │
                       v
                ┌─────────────┐
                │   Redis     │
                │   Cache     │
                └─────────────┘
```

**Use Case**: Production applications with high availability

### Architecture 3: Distributed Inference (High Throughput)

```
                    ┌──────────────┐
                    │  API Gateway │
                    └──────┬───────┘
                           │
                    ┌──────┴───────┐
                    │ Job Scheduler│
                    └──────┬───────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        v                  v                  v
   ┌─────────┐        ┌─────────┐        ┌─────────┐
   │ GPU 0-1 │        │ GPU 2-3 │        │ GPU 4-5 │
   │ Tensor  │        │ Tensor  │        │ Tensor  │
   │Parallel │        │Parallel │        │Parallel │
   └─────────┘        └─────────┘        └─────────┘
```

**Use Case**: Very high throughput, batch processing

**Setup with Ray Serve**:
```python
import ray
from ray import serve
from transformers import AutoModelForCausalLM, AutoTokenizer

ray.init()
serve.start()

@serve.deployment(num_replicas=3, ray_actor_options={"num_gpus": 1})
class HelionModel:
    def __init__(self):
        self.model = AutoModelForCausalLM.from_pretrained(
            "DeepXR/Helion-V1.5-XL",
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        self.tokenizer = AutoTokenizer.from_pretrained("DeepXR/Helion-V1.5-XL")
    
    async def __call__(self, request):
        prompt = await request.json()
        inputs = self.tokenizer(prompt["text"], return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(**inputs, max_new_tokens=512)
        return {"response": self.tokenizer.decode(outputs[0], skip_special_tokens=True)}

HelionModel.deploy()
```

---

## Performance Optimization

### 1. Quantization

```python
# 8-bit Quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model = AutoModelForCausalLM.from_pretrained(
    "DeepXR/Helion-V1.5-XL",
    quantization_config=quantization_config,
    device_map="auto"
)

# 4-bit Quantization (Maximum memory savings)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)
```

### 2. Flash Attention

```python
# Enable Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    "DeepXR/Helion-V1.5-XL",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)
```

### 3. Compilation with torch.compile

```python
# Compile model for faster inference (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")
```

### 4. KV Cache Optimization

```python
# Use cache for faster generation
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    past_key_values=past_key_values  # Reuse from previous generation
)
```

### 5. Batching

```python
# Process multiple prompts in batch
prompts = ["Prompt 1", "Prompt 2", "Prompt 3"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)

# Decode all outputs
responses = [tokenizer.decode(out, skip_special_tokens=True) for out in outputs]
```

### Performance Benchmarks by Configuration

| Configuration | Tokens/sec | Latency (ms) | Memory (GB) | Cost Efficiency |
|---------------|------------|--------------|-------------|-----------------|
| A100 BF16 | 47.3 | 21.1 | 34.2 | Baseline |
| A100 INT8 | 89.6 | 11.2 | 17.8 | 1.9x faster |
| A100 INT4 | 134.2 | 7.5 | 10.4 | 2.8x faster |
| H100 BF16 | 78.1 | 12.8 | 34.2 | 1.65x faster |
| H100 INT4 | 218.7 | 4.6 | 10.4 | 4.6x faster |

---

## Monitoring and Logging

### Prometheus Metrics

```python
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Metrics
request_count = Counter('helion_requests_total', 'Total requests')
request_duration = Histogram('helion_request_duration_seconds', 'Request duration')
active_requests = Gauge('helion_active_requests', 'Active requests')
token_count = Counter('helion_tokens_generated', 'Tokens generated')
error_count = Counter('helion_errors_total', 'Total errors', ['error_type'])

# Start metrics server
start_http_server(9090)
```

### Structured Logging

```python
import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
            "module": record.module,
            "function": record.funcName,
            "line": record.lineno
        }
        return json.dumps(log_data)

handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger()
logger.addHandler(handler)
logger.setLevel(logging.INFO)
```

### Health Check Endpoint

```python
@app.get("/health")
async def health_check():
    try:
        # Check model is loaded
        assert model is not None
        # Check GPU is available
        assert torch.cuda.is_available()
        # Quick inference test
        test_input = tokenizer("test", return_tensors="pt").to(model.device)
        _ = model.generate(**test_input, max_new_tokens=1)
        return {"status": "healthy", "timestamp": datetime.utcnow().isoformat()}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}, 503
```

### Grafana Dashboard Configuration

```json
{
  "dashboard": {
    "title": "Helion-V1.5-XL Monitoring",
    "panels": [
      {
        "title": "Requests per Second",
        "targets": [{"expr": "rate(helion_requests_total[1m])"}]
      },
      {
        "title": "Average Latency",
        "targets": [{"expr": "rate(helion_request_duration_seconds_sum[5m]) / rate(helion_request_duration_seconds_count[5m])"}]
      },
      {
        "title": "GPU Utilization",
        "targets": [{"expr": "nvidia_gpu_utilization"}]
      },
      {
        "title": "GPU Memory Usage",
        "targets": [{"expr": "nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100"}]
      }
    ]
  }
}
```

---

## Scaling Strategies

### Horizontal Scaling

```bash
# Using Kubernetes HPA
kubectl autoscale deployment helion-v15-xl \
  --min=2 \
  --max=10 \
  --cpu-percent=70 \
  --memory-percent=80
```

### Vertical Scaling

| Traffic Level | Configuration | Instances |
|---------------|---------------|-----------|
| Low (< 10 req/s) | 1x A100 40GB, INT8 | 1 |
| Medium (10-50 req/s) | 1x A100 80GB, BF16 | 2-3 |
| High (50-200 req/s) | 2x A100 80GB, BF16 | 4-6 |
| Very High (200+ req/s) | Multiple H100 clusters | 10+ |

### Request Queuing

```python
from asyncio import Queue, create_task
import asyncio

request_queue = Queue(maxsize=100)
batch_size = 8

async def batch_processor():
    while True:
        batch = []
        for _ in range(batch_size):
            try:
                item = await asyncio.wait_for(request_queue.get(), timeout=0.1)
                batch.append(item)
            except asyncio.TimeoutError:
                break
        
        if batch:
            # Process batch
            prompts = [item["prompt"] for item in batch]
            inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
            outputs = model.generate(**inputs, max_new_tokens=256)
            
            # Return results
            for item, output in zip(batch, outputs):
                item["future"].set_result(tokenizer.decode(output, skip_special_tokens=True))

# Start background task
create_task(batch_processor())
```

---

## Security Best Practices

### 1. API Authentication

```python
from fastapi import HTTPException, Security
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    if credentials.credentials != os.getenv("API_TOKEN"):
        raise HTTPException(status_code=401, detail="Invalid authentication")
    return credentials.credentials

@app.post("/generate")
async def generate(prompt: str, token: str = Security(verify_token)):
    # Process request
    pass
```

### 2. Rate Limiting

```python
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)

@app.post("/generate")
@limiter.limit("60/minute")
async def generate(request: Request, prompt: str):
    # Process request
    pass
```

### 3. Input Validation

```python
from pydantic import BaseModel, Field, validator

class GenerationRequest(BaseModel):
    prompt: str = Field(..., min_length=1, max_length=8000)
    max_tokens: int = Field(512, ge=1, le=2048)
    temperature: float = Field(0.7, ge=0.0, le=2.0)
    
    @validator('prompt')
    def validate_prompt(cls, v):
        # Check for malicious content
        if any(bad in v.lower() for bad in ['<script>', 'DROP TABLE']):
            raise ValueError('Invalid prompt content')
        return v
```

### 4. Content Filtering Integration

```python
from safeguard_filters import ContentSafetyFilter, RefusalGenerator

safety_filter = ContentSafetyFilter()
refusal_gen = RefusalGenerator()

@app.post("/generate")
async def generate(request: GenerationRequest):
    # Check input safety
    is_safe, violations = safety_filter.check_input(request.prompt)
    if not is_safe:
        return {"error": refusal_gen.generate_refusal(violations[0])}
    
    # Generate response
    outputs = model.generate(...)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Check output safety
    is_safe, violations = safety_filter.check_output(response)
    if not is_safe:
        response = safety_filter.redact_pii(response)
    
    return {"response": response}
```

---

## Troubleshooting

### Common Issues and Solutions

#### Issue 1: Out of Memory (OOM)

**Symptoms**: CUDA out of memory error

**Solutions**:
```python
# Solution 1: Use quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,  # or load_in_4bit=True
    device_map="auto"
)

# Solution 2: Reduce batch size
# Use batch_size=1 for inference

# Solution 3: Reduce context length
outputs = model.generate(**inputs, max_new_tokens=256)  # Instead of 512

# Solution 4: Clear cache
torch.cuda.empty_cache()
```

#### Issue 2: Slow Inference

**Symptoms**: High latency, low throughput

**Solutions**:
```python
# Solution 1: Enable Flash Attention
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2"
)

# Solution 2: Use compilation
model = torch.compile(model)

# Solution 3: Use vLLM
# Install: pip install vllm
# Run with vLLM server (much faster)

# Solution 4: Batch requests
# Process multiple requests together
```

#### Issue 3: Model Not Loading

**Symptoms**: Download errors, corruption

**Solutions**:
```bash
# Clear cache
rm -rf ~/.cache/huggingface/

# Download manually
huggingface-cli download DeepXR/Helion-V1.5-XL

# Check disk space
df -h

# Verify CUDA installation
nvidia-smi
```

#### Issue 4: Quality Degradation with Quantization

**Solutions**:
- Use INT8 instead of INT4
- Calibrate quantization with representative data
- Use double quantization: `bnb_4bit_use_double_quant=True`

### Debugging Commands

```bash
# Check GPU status
nvidia-smi

# Monitor GPU usage
watch -n 1 nvidia-smi

# Check Python packages
pip list | grep -E "torch|transformers"

# Test CUDA
python -c "import torch; print(torch.cuda.is_available())"

# Memory profiling
python -m memory_profiler your_script.py

# Performance profiling
python -m cProfile -o output.prof your_script.py
```

---

## Production Checklist

### Pre-Deployment

- [ ] Hardware requirements verified
- [ ] Dependencies installed and tested
- [ ] Model downloaded and loaded successfully
- [ ] Inference tested with sample prompts
- [ ] Performance benchmarks meet requirements
- [ ] Memory usage within acceptable limits
- [ ] Safety filters configured and tested
- [ ] API authentication implemented
- [ ] Rate limiting configured
- [ ] Input validation in place
- [ ] Error handling implemented
- [ ] Logging configured
- [ ] Monitoring dashboards set up
- [ ] Health check endpoints working
- [ ] Load testing completed
- [ ] Security audit passed
- [ ] Documentation complete

### Post-Deployment

- [ ] Monitor error rates
- [ ] Track latency metrics
- [ ] Monitor GPU utilization
- [ ] Check memory usage trends
- [ ] Review safety violation logs
- [ ] Analyze user feedback
- [ ] Update model if needed
- [ ] Scale based on load
- [ ] Regular security updates
- [ ] Backup configurations
- [ ] Disaster recovery tested
- [ ] Performance optimization ongoing

### Maintenance Schedule

| Task | Frequency | Responsibility |
|------|-----------|----------------|
| Check error logs | Daily | DevOps |
| Review performance metrics | Daily | ML Engineers |
| Security updates | Weekly | Security Team |
| Model evaluation | Monthly | Data Science |
| Capacity planning | Monthly | Infrastructure |
| Disaster recovery drill | Quarterly | All Teams |
| Full system audit | Annually | External Auditor |

---

## Additional Resources

### Documentation
- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [PyTorch Documentation](https://pytorch.org/docs)
- [CUDA Programming Guide](https://docs.nvidia.com/cuda/)

### Support Channels
- GitHub Issues: For bug reports and feature requests
- Community Forum: For general questions and discussions
- Enterprise Support: For production deployments

### Example Projects
- REST API Server: `/examples/rest_api`
- Streaming Interface: `/examples/streaming`
- Batch Processing: `/examples/batch_processing`
- Fine-tuning: `/examples/fine_tuning`

---

## Version History

| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2024-11-01 | Initial release |
| 1.0.1 | 2024-11-15 | Performance optimizations |
| 1.1.0 | 2024-12-01 | Flash Attention 2 support |

---

**Last Updated**: 2024-11-10

**Maintained By**: DeepXR Engineering Team