Helion-V2.0-Thinking Quickstart Guide
Get started with Helion-V2.0-Thinking in minutes.
Installation
Basic Installation
pip install transformers torch accelerate pillow requests
Full Installation (with all features)
pip install -r requirements.txt
GPU Requirements
- Minimum: 24GB VRAM (RTX 4090, A5000)
- Recommended: 40GB+ VRAM (A100, H100)
- Quantized (8-bit): 16GB VRAM
- Quantized (4-bit): 12GB VRAM
Quick Examples
1. Basic Text Generation
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "DeepXR/Helion-V2.0-Thinking"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
prompt = "What is artificial intelligence?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
2. Image Understanding
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
image = Image.open("photo.jpg")
prompt = "What is in this image?"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))
3. Using the Inference Script
# Interactive chat mode
python inference.py --interactive
# With image analysis
python inference.py --image photo.jpg --prompt "Describe this image"
# Run demos
python inference.py --demo
# With quantization (saves memory)
python inference.py --interactive --load-in-4bit
4. With Safety Wrapper
from safety_wrapper import SafeHelionWrapper
# Initialize with safety features
wrapper = SafeHelionWrapper(
model_name="DeepXR/Helion-V2.0-Thinking",
enable_safety=True,
enable_rate_limiting=True
)
# Safe generation
response = wrapper.generate(
prompt="Explain photosynthesis",
max_new_tokens=256
)
print(response)
5. Function Calling
import json
tools = [{
"name": "calculator",
"description": "Perform calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string"}
}
}
}]
prompt = f"""Available tools: {json.dumps(tools)}
User: What is 125 * 48?
Assistant (respond with JSON):"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Memory-Efficient Options
8-bit Quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto"
)
4-bit Quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto"
)
Running Benchmarks
# Full benchmark suite
python benchmark.py --model DeepXR/Helion-V2.0-Thinking
# Evaluation suite
python evaluate.py --model DeepXR/Helion-V2.0-Thinking
Common Use Cases
Chatbot
conversation = []
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
conversation.append({"role": "user", "content": user_input})
prompt = "\n".join([
f"{msg['role'].capitalize()}: {msg['content']}"
for msg in conversation
]) + "\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = response.split("Assistant:")[-1].strip()
conversation.append({"role": "assistant", "content": response})
print(f"Assistant: {response}")
Document Analysis
# Read long document
with open("document.txt", "r") as f:
document = f.read()
prompt = f"""{document}
Please provide:
1. A summary of the main points
2. Key takeaways
3. Any recommendations
Summary:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Code Generation
prompt = """Write a Python function that:
1. Takes a list of numbers
2. Removes duplicates
3. Returns sorted in descending order
Include type hints and docstring."""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.3 # Lower temperature for code
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Troubleshooting
Out of Memory
- Use quantization (4-bit or 8-bit)
- Reduce
max_new_tokens - Enable gradient checkpointing
- Use smaller batch sizes
Slow Performance
- Enable Flash Attention 2:
use_flash_attention_2=True - Use GPU if available
- Reduce context length
- Use quantization
Installation Issues
# Update pip
pip install --upgrade pip
# Install from scratch
pip uninstall transformers torch
pip install transformers torch accelerate
# CUDA issues
pip install torch --index-url https://download.pytorch.org/whl/cu121
Next Steps
- Read the full README.md for detailed documentation
- Check out inference.py for more examples
- Review safety_wrapper.py for safety features
- Run benchmark.py to test performance
- See evaluate.py for quality metrics
Support
For issues and questions:
- Check the Hugging Face model page
- Review existing issues
- Submit a new issue with details
License
Apache 2.0 - See LICENSE file for details