# Helion-V2.0-Thinking Quickstart Guide Get started with Helion-V2.0-Thinking in minutes. ## Installation ### Basic Installation ```bash pip install transformers torch accelerate pillow requests ``` ### Full Installation (with all features) ```bash pip install -r requirements.txt ``` ### GPU Requirements - **Minimum**: 24GB VRAM (RTX 4090, A5000) - **Recommended**: 40GB+ VRAM (A100, H100) - **Quantized (8-bit)**: 16GB VRAM - **Quantized (4-bit)**: 12GB VRAM ## Quick Examples ### 1. Basic Text Generation ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "DeepXR/Helion-V2.0-Thinking" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) prompt = "What is artificial intelligence?" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### 2. Image Understanding ```python from transformers import AutoModelForCausalLM, AutoProcessor from PIL import Image processor = AutoProcessor.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) image = Image.open("photo.jpg") prompt = "What is in this image?" inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(processor.decode(outputs[0], skip_special_tokens=True)) ``` ### 3. Using the Inference Script ```bash # Interactive chat mode python inference.py --interactive # With image analysis python inference.py --image photo.jpg --prompt "Describe this image" # Run demos python inference.py --demo # With quantization (saves memory) python inference.py --interactive --load-in-4bit ``` ### 4. With Safety Wrapper ```python from safety_wrapper import SafeHelionWrapper # Initialize with safety features wrapper = SafeHelionWrapper( model_name="DeepXR/Helion-V2.0-Thinking", enable_safety=True, enable_rate_limiting=True ) # Safe generation response = wrapper.generate( prompt="Explain photosynthesis", max_new_tokens=256 ) print(response) ``` ### 5. Function Calling ```python import json tools = [{ "name": "calculator", "description": "Perform calculations", "parameters": { "type": "object", "properties": { "expression": {"type": "string"} } } }] prompt = f"""Available tools: {json.dumps(tools)} User: What is 125 * 48? Assistant (respond with JSON):""" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.2) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Memory-Efficient Options ### 8-bit Quantization ```python from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=quantization_config, device_map="auto" ) ``` ### 4-bit Quantization ```python quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4" ) model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=quantization_config, device_map="auto" ) ``` ## Running Benchmarks ```bash # Full benchmark suite python benchmark.py --model DeepXR/Helion-V2.0-Thinking # Evaluation suite python evaluate.py --model DeepXR/Helion-V2.0-Thinking ``` ## Common Use Cases ### Chatbot ```python conversation = [] while True: user_input = input("You: ") if user_input.lower() == 'quit': break conversation.append({"role": "user", "content": user_input}) prompt = "\n".join([ f"{msg['role'].capitalize()}: {msg['content']}" for msg in conversation ]) + "\nAssistant:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) response = tokenizer.decode(outputs[0], skip_special_tokens=True) response = response.split("Assistant:")[-1].strip() conversation.append({"role": "assistant", "content": response}) print(f"Assistant: {response}") ``` ### Document Analysis ```python # Read long document with open("document.txt", "r") as f: document = f.read() prompt = f"""{document} Please provide: 1. A summary of the main points 2. Key takeaways 3. Any recommendations Summary:""" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=1024) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Code Generation ```python prompt = """Write a Python function that: 1. Takes a list of numbers 2. Removes duplicates 3. Returns sorted in descending order Include type hints and docstring.""" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.3 # Lower temperature for code ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Troubleshooting ### Out of Memory 1. Use quantization (4-bit or 8-bit) 2. Reduce `max_new_tokens` 3. Enable gradient checkpointing 4. Use smaller batch sizes ### Slow Performance 1. Enable Flash Attention 2: `use_flash_attention_2=True` 2. Use GPU if available 3. Reduce context length 4. Use quantization ### Installation Issues ```bash # Update pip pip install --upgrade pip # Install from scratch pip uninstall transformers torch pip install transformers torch accelerate # CUDA issues pip install torch --index-url https://download.pytorch.org/whl/cu121 ``` ## Next Steps - Read the full [README.md](README.md) for detailed documentation - Check out [inference.py](inference.py) for more examples - Review [safety_wrapper.py](safety_wrapper.py) for safety features - Run [benchmark.py](benchmark.py) to test performance - See [evaluate.py](evaluate.py) for quality metrics ## Support For issues and questions: - Check the Hugging Face model page - Review existing issues - Submit a new issue with details ## License Apache 2.0 - See LICENSE file for details