File size: 6,236 Bytes
67c580c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# Deployment Configuration Guide

## Critical Issues and Solutions

### 1. Cache Directory Permissions

**Problem**: `PermissionError: [Errno 13] Permission denied: '/.cache'`

**Solution**: The code now automatically detects Docker and uses `/tmp/huggingface_cache`. However, ensure the Dockerfile sets proper permissions.

**Dockerfile Fix**:
```dockerfile
# Create cache directory with proper permissions
RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
```

### 2. User ID Issues

**Problem**: `KeyError: 'getpwuid(): uid not found: 1000'`

**Solution**: Run container with proper user or ensure user exists in container.

**Option A - Use root (simplest for HF Spaces)**:
```dockerfile
# Already running as root in HF Spaces - this is fine
# Just ensure cache directories are writable
```

**Option B - Create user in Dockerfile**:
```dockerfile
RUN useradd -m -u 1000 -s /bin/bash appuser && \
    mkdir -p /tmp/huggingface_cache && \
    chown -R appuser:appuser /tmp/huggingface_cache /app
USER appuser
```

**For Hugging Face Spaces**: Spaces typically run as root, so Option A is fine.

### 3. HuggingFace Token Configuration

**Problem**: Gated repository access errors

**Solution**: Set HF_TOKEN in Hugging Face Spaces secrets.

**Steps**:
1. Go to your Space → Settings → Repository secrets
2. Add `HF_TOKEN` with your Hugging Face access token
3. Token should have read access to gated models

**Verify Token**:
```bash
# Test token access
curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct
```

### 4. GPU Tensor Device Placement

**Problem**: `Tensor on device cuda:0 is not on the expected device meta!`

**Solution**: Use explicit device placement instead of `device_map="auto"` for non-quantized models.

**Code Fix**: Already implemented in `src/local_model_loader.py` - uses `device_map="auto"` only with quantization, explicit placement otherwise.

### 5. Model Selection for Testing

**Current Models**:
- Primary: `Qwen/Qwen2.5-7B-Instruct` (gated - requires access)
- Fallback: `microsoft/Phi-3-mini-4k-instruct` (non-gated, verified)

**For Testing Without Gated Models**:
Update `src/models_config.py` to use non-gated models:
```python
"reasoning_primary": {
    "model_id": "microsoft/Phi-3-mini-4k-instruct",  # Non-gated
    ...
}
```

## Recommended Dockerfile Updates

```dockerfile
FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    cmake \
    libopenblas-dev \
    libomp-dev \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create cache directories with proper permissions
RUN mkdir -p /tmp/huggingface_cache && \
    chmod 777 /tmp/huggingface_cache && \
    mkdir -p /tmp/logs && \
    chmod 777 /tmp/logs

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV PORT=7860
ENV OMP_NUM_THREADS=4
ENV MKL_NUM_THREADS=4
ENV DB_PATH=/tmp/sessions.db
ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
ENV LOG_DIR=/tmp/logs
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
ENV RATE_LIMIT_ENABLED=true

# Expose port
EXPOSE 7860

# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
    CMD curl -f http://localhost:7860/api/health || exit 1

# Run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"]
```

## Hugging Face Spaces Configuration

### Required Secrets:
1. `HF_TOKEN` - Your Hugging Face access token (for gated models)

### Environment Variables (Optional):
- `HF_HOME` - Will auto-detect to `/tmp/huggingface_cache` in Docker
- `TRANSFORMERS_CACHE` - Will auto-detect to `/tmp/huggingface_cache` in Docker

### Hardware Requirements:
- GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs
- Memory: At least 8GB RAM
- Disk: 20GB+ for model cache

## Verification Steps

1. **Check Cache Directory**:
   ```bash
   ls -la /tmp/huggingface_cache
   # Should show writable directory
   ```

2. **Check HF Token**:
   ```python
   import os
   print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN")))
   ```

3. **Check GPU**:
   ```python
   import torch
   print("CUDA available:", torch.cuda.is_available())
   print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
   ```

4. **Test Model Loading**:
   - Check logs for: `✓ Cache directory verified: /tmp/huggingface_cache`
   - Check logs for: `✓ HF_TOKEN authenticated for gated model access` (if token set)
   - Check logs for: `✓ Model loaded successfully`

## Troubleshooting

### Issue: Still getting permission errors
**Fix**: Ensure Dockerfile creates cache directory with 777 permissions

### Issue: Gated repository errors persist
**Fix**: 
1. Verify HF_TOKEN is set in Spaces secrets
2. Visit model page and request access
3. Wait for approval (usually instant)
4. Use fallback model (Phi-3-mini) until access granted

### Issue: Tensor device errors
**Fix**: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement

### Issue: Model too large for GPU
**Fix**: 
- Code automatically falls back to no quantization if bitsandbytes fails
- Consider using smaller model (Phi-3-mini) for testing
- Check GPU memory: `nvidia-smi`

## Quick Start Checklist

- [ ] HF_TOKEN set in Spaces secrets
- [ ] Dockerfile creates cache directory with proper permissions
- [ ] GPU detected (check logs)
- [ ] Cache directory writable (check logs)
- [ ] Model access granted (or using non-gated fallback)
- [ ] No tensor device errors (check logs)

## Next Steps

1. Update Dockerfile with cache directory creation
2. Set HF_TOKEN in Spaces secrets
3. Request access to gated models (Qwen)
4. Test with fallback model first (Phi-3-mini)
5. Monitor logs for successful model loading