HonestAI

Paused

File size: 6,236 Bytes

67c580c

# Deployment Configuration Guide

## Critical Issues and Solutions

### 1. Cache Directory Permissions

**Problem**: `PermissionError: [Errno 13] Permission denied: '/.cache'`

**Solution**: The code now automatically detects Docker and uses `/tmp/huggingface_cache`. However, ensure the Dockerfile sets proper permissions.

**Dockerfile Fix**:
```dockerfile
# Create cache directory with proper permissions
RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
```

### 2. User ID Issues

**Problem**: `KeyError: 'getpwuid(): uid not found: 1000'`

**Solution**: Run container with proper user or ensure user exists in container.

**Option A - Use root (simplest for HF Spaces)**:
```dockerfile
# Already running as root in HF Spaces - this is fine
# Just ensure cache directories are writable
```

**Option B - Create user in Dockerfile**:
```dockerfile
RUN useradd -m -u 1000 -s /bin/bash appuser && \
    mkdir -p /tmp/huggingface_cache && \
    chown -R appuser:appuser /tmp/huggingface_cache /app
USER appuser
```

**For Hugging Face Spaces**: Spaces typically run as root, so Option A is fine.

### 3. HuggingFace Token Configuration

**Problem**: Gated repository access errors

**Solution**: Set HF_TOKEN in Hugging Face Spaces secrets.

**Steps**:
1. Go to your Space → Settings → Repository secrets
2. Add `HF_TOKEN` with your Hugging Face access token
3. Token should have read access to gated models

**Verify Token**:
```bash
# Test token access
curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct
```

### 4. GPU Tensor Device Placement

**Problem**: `Tensor on device cuda:0 is not on the expected device meta!`

**Solution**: Use explicit device placement instead of `device_map="auto"` for non-quantized models.

**Code Fix**: Already implemented in `src/local_model_loader.py` - uses `device_map="auto"` only with quantization, explicit placement otherwise.

### 5. Model Selection for Testing

**Current Models**:
- Primary: `Qwen/Qwen2.5-7B-Instruct` (gated - requires access)
- Fallback: `microsoft/Phi-3-mini-4k-instruct` (non-gated, verified)

**For Testing Without Gated Models**:
Update `src/models_config.py` to use non-gated models:
```python
"reasoning_primary": {
    "model_id": "microsoft/Phi-3-mini-4k-instruct",  # Non-gated
    ...
}
```

## Recommended Dockerfile Updates

```dockerfile
FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    cmake \
    libopenblas-dev \
    libomp-dev \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create cache directories with proper permissions
RUN mkdir -p /tmp/huggingface_cache && \
    chmod 777 /tmp/huggingface_cache && \
    mkdir -p /tmp/logs && \
    chmod 777 /tmp/logs

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV PORT=7860
ENV OMP_NUM_THREADS=4
ENV MKL_NUM_THREADS=4
ENV DB_PATH=/tmp/sessions.db
ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
ENV LOG_DIR=/tmp/logs
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
ENV RATE_LIMIT_ENABLED=true

# Expose port
EXPOSE 7860

# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
    CMD curl -f http://localhost:7860/api/health || exit 1

# Run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"]
```

## Hugging Face Spaces Configuration

### Required Secrets:
1. `HF_TOKEN` - Your Hugging Face access token (for gated models)

### Environment Variables (Optional):
- `HF_HOME` - Will auto-detect to `/tmp/huggingface_cache` in Docker
- `TRANSFORMERS_CACHE` - Will auto-detect to `/tmp/huggingface_cache` in Docker

### Hardware Requirements:
- GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs
- Memory: At least 8GB RAM
- Disk: 20GB+ for model cache

## Verification Steps

1. **Check Cache Directory**:
   ```bash
   ls -la /tmp/huggingface_cache
   # Should show writable directory
   ```

2. **Check HF Token**:
   ```python
   import os
   print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN")))
   ```

3. **Check GPU**:
   ```python
   import torch
   print("CUDA available:", torch.cuda.is_available())
   print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
   ```

4. **Test Model Loading**:
   - Check logs for: `✓ Cache directory verified: /tmp/huggingface_cache`
   - Check logs for: `✓ HF_TOKEN authenticated for gated model access` (if token set)
   - Check logs for: `✓ Model loaded successfully`

## Troubleshooting

### Issue: Still getting permission errors
**Fix**: Ensure Dockerfile creates cache directory with 777 permissions

### Issue: Gated repository errors persist
**Fix**: 
1. Verify HF_TOKEN is set in Spaces secrets
2. Visit model page and request access
3. Wait for approval (usually instant)
4. Use fallback model (Phi-3-mini) until access granted

### Issue: Tensor device errors
**Fix**: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement

### Issue: Model too large for GPU
**Fix**: 
- Code automatically falls back to no quantization if bitsandbytes fails
- Consider using smaller model (Phi-3-mini) for testing
- Check GPU memory: `nvidia-smi`

## Quick Start Checklist

- [ ] HF_TOKEN set in Spaces secrets
- [ ] Dockerfile creates cache directory with proper permissions
- [ ] GPU detected (check logs)
- [ ] Cache directory writable (check logs)
- [ ] Model access granted (or using non-gated fallback)
- [ ] No tensor device errors (check logs)

## Next Steps

1. Update Dockerfile with cache directory creation
2. Set HF_TOKEN in Spaces secrets
3. Request access to gated models (Qwen)
4. Test with fallback model first (Phi-3-mini)
5. Monitor logs for successful model loading