Helion-V2.0-Thinking Quickstart Guide

Get started with Helion-V2.0-Thinking in minutes.

Installation

Basic Installation

pip install transformers torch accelerate pillow requests

Full Installation (with all features)

pip install -r requirements.txt

GPU Requirements

Minimum: 24GB VRAM (RTX 4090, A5000)
Recommended: 40GB+ VRAM (A100, H100)
Quantized (8-bit): 16GB VRAM
Quantized (4-bit): 12GB VRAM

Quick Examples

1. Basic Text Generation

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "DeepXR/Helion-V2.0-Thinking"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "What is artificial intelligence?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

2. Image Understanding

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

image = Image.open("photo.jpg")
prompt = "What is in this image?"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))

3. Using the Inference Script

# Interactive chat mode
python inference.py --interactive

# With image analysis
python inference.py --image photo.jpg --prompt "Describe this image"

# Run demos
python inference.py --demo

# With quantization (saves memory)
python inference.py --interactive --load-in-4bit

4. With Safety Wrapper

from safety_wrapper import SafeHelionWrapper

# Initialize with safety features
wrapper = SafeHelionWrapper(
    model_name="DeepXR/Helion-V2.0-Thinking",
    enable_safety=True,
    enable_rate_limiting=True
)

# Safe generation
response = wrapper.generate(
    prompt="Explain photosynthesis",
    max_new_tokens=256
)
print(response)

5. Function Calling

import json

tools = [{
    "name": "calculator",
    "description": "Perform calculations",
    "parameters": {
        "type": "object",
        "properties": {
            "expression": {"type": "string"}
        }
    }
}]

prompt = f"""Available tools: {json.dumps(tools)}

User: What is 125 * 48?
Assistant (respond with JSON):"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Memory-Efficient Options

8-bit Quantization

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

4-bit Quantization

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

Running Benchmarks

# Full benchmark suite
python benchmark.py --model DeepXR/Helion-V2.0-Thinking

# Evaluation suite
python evaluate.py --model DeepXR/Helion-V2.0-Thinking

Common Use Cases

Chatbot

conversation = []

while True:
    user_input = input("You: ")
    if user_input.lower() == 'quit':
        break
    
    conversation.append({"role": "user", "content": user_input})
    
    prompt = "\n".join([
        f"{msg['role'].capitalize()}: {msg['content']}" 
        for msg in conversation
    ]) + "\nAssistant:"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=512)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = response.split("Assistant:")[-1].strip()
    
    conversation.append({"role": "assistant", "content": response})
    print(f"Assistant: {response}")

Document Analysis

# Read long document
with open("document.txt", "r") as f:
    document = f.read()

prompt = f"""{document}

Please provide:
1. A summary of the main points
2. Key takeaways
3. Any recommendations

Summary:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Code Generation

prompt = """Write a Python function that:
1. Takes a list of numbers
2. Removes duplicates
3. Returns sorted in descending order

Include type hints and docstring."""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3  # Lower temperature for code
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Troubleshooting

Out of Memory

Use quantization (4-bit or 8-bit)
Reduce max_new_tokens
Enable gradient checkpointing
Use smaller batch sizes

Slow Performance

Enable Flash Attention 2: use_flash_attention_2=True
Use GPU if available
Reduce context length
Use quantization

Installation Issues

# Update pip
pip install --upgrade pip

# Install from scratch
pip uninstall transformers torch
pip install transformers torch accelerate

# CUDA issues
pip install torch --index-url https://download.pytorch.org/whl/cu121

Next Steps

Read the full README.md for detailed documentation
Check out inference.py for more examples
Review safety_wrapper.py for safety features
Run benchmark.py to test performance
See evaluate.py for quality metrics

Support

For issues and questions:

Check the Hugging Face model page
Review existing issues
Submit a new issue with details

License

Apache 2.0 - See LICENSE file for details