| # Helion-V2.0-Thinking Quickstart Guide | |
| Get started with Helion-V2.0-Thinking in minutes. | |
| ## Installation | |
| ### Basic Installation | |
| ```bash | |
| pip install transformers torch accelerate pillow requests | |
| ``` | |
| ### Full Installation (with all features) | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### GPU Requirements | |
| - **Minimum**: 24GB VRAM (RTX 4090, A5000) | |
| - **Recommended**: 40GB+ VRAM (A100, H100) | |
| - **Quantized (8-bit)**: 16GB VRAM | |
| - **Quantized (4-bit)**: 12GB VRAM | |
| ## Quick Examples | |
| ### 1. Basic Text Generation | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_name = "DeepXR/Helion-V2.0-Thinking" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| torch_dtype="auto", | |
| device_map="auto" | |
| ) | |
| prompt = "What is artificial intelligence?" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=256) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### 2. Image Understanding | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| from PIL import Image | |
| processor = AutoProcessor.from_pretrained(model_name) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| torch_dtype="auto", | |
| device_map="auto" | |
| ) | |
| image = Image.open("photo.jpg") | |
| prompt = "What is in this image?" | |
| inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=256) | |
| print(processor.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### 3. Using the Inference Script | |
| ```bash | |
| # Interactive chat mode | |
| python inference.py --interactive | |
| # With image analysis | |
| python inference.py --image photo.jpg --prompt "Describe this image" | |
| # Run demos | |
| python inference.py --demo | |
| # With quantization (saves memory) | |
| python inference.py --interactive --load-in-4bit | |
| ``` | |
| ### 4. With Safety Wrapper | |
| ```python | |
| from safety_wrapper import SafeHelionWrapper | |
| # Initialize with safety features | |
| wrapper = SafeHelionWrapper( | |
| model_name="DeepXR/Helion-V2.0-Thinking", | |
| enable_safety=True, | |
| enable_rate_limiting=True | |
| ) | |
| # Safe generation | |
| response = wrapper.generate( | |
| prompt="Explain photosynthesis", | |
| max_new_tokens=256 | |
| ) | |
| print(response) | |
| ``` | |
| ### 5. Function Calling | |
| ```python | |
| import json | |
| tools = [{ | |
| "name": "calculator", | |
| "description": "Perform calculations", | |
| "parameters": { | |
| "type": "object", | |
| "properties": { | |
| "expression": {"type": "string"} | |
| } | |
| } | |
| }] | |
| prompt = f"""Available tools: {json.dumps(tools)} | |
| User: What is 125 * 48? | |
| Assistant (respond with JSON):""" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.2) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ## Memory-Efficient Options | |
| ### 8-bit Quantization | |
| ```python | |
| from transformers import BitsAndBytesConfig | |
| quantization_config = BitsAndBytesConfig(load_in_8bit=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| quantization_config=quantization_config, | |
| device_map="auto" | |
| ) | |
| ``` | |
| ### 4-bit Quantization | |
| ```python | |
| quantization_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_compute_dtype=torch.bfloat16, | |
| bnb_4bit_use_double_quant=True, | |
| bnb_4bit_quant_type="nf4" | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| quantization_config=quantization_config, | |
| device_map="auto" | |
| ) | |
| ``` | |
| ## Running Benchmarks | |
| ```bash | |
| # Full benchmark suite | |
| python benchmark.py --model DeepXR/Helion-V2.0-Thinking | |
| # Evaluation suite | |
| python evaluate.py --model DeepXR/Helion-V2.0-Thinking | |
| ``` | |
| ## Common Use Cases | |
| ### Chatbot | |
| ```python | |
| conversation = [] | |
| while True: | |
| user_input = input("You: ") | |
| if user_input.lower() == 'quit': | |
| break | |
| conversation.append({"role": "user", "content": user_input}) | |
| prompt = "\n".join([ | |
| f"{msg['role'].capitalize()}: {msg['content']}" | |
| for msg in conversation | |
| ]) + "\nAssistant:" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=512) | |
| response = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| response = response.split("Assistant:")[-1].strip() | |
| conversation.append({"role": "assistant", "content": response}) | |
| print(f"Assistant: {response}") | |
| ``` | |
| ### Document Analysis | |
| ```python | |
| # Read long document | |
| with open("document.txt", "r") as f: | |
| document = f.read() | |
| prompt = f"""{document} | |
| Please provide: | |
| 1. A summary of the main points | |
| 2. Key takeaways | |
| 3. Any recommendations | |
| Summary:""" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=1024) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### Code Generation | |
| ```python | |
| prompt = """Write a Python function that: | |
| 1. Takes a list of numbers | |
| 2. Removes duplicates | |
| 3. Returns sorted in descending order | |
| Include type hints and docstring.""" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=512, | |
| temperature=0.3 # Lower temperature for code | |
| ) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ## Troubleshooting | |
| ### Out of Memory | |
| 1. Use quantization (4-bit or 8-bit) | |
| 2. Reduce `max_new_tokens` | |
| 3. Enable gradient checkpointing | |
| 4. Use smaller batch sizes | |
| ### Slow Performance | |
| 1. Enable Flash Attention 2: `use_flash_attention_2=True` | |
| 2. Use GPU if available | |
| 3. Reduce context length | |
| 4. Use quantization | |
| ### Installation Issues | |
| ```bash | |
| # Update pip | |
| pip install --upgrade pip | |
| # Install from scratch | |
| pip uninstall transformers torch | |
| pip install transformers torch accelerate | |
| # CUDA issues | |
| pip install torch --index-url https://download.pytorch.org/whl/cu121 | |
| ``` | |
| ## Next Steps | |
| - Read the full [README.md](README.md) for detailed documentation | |
| - Check out [inference.py](inference.py) for more examples | |
| - Review [safety_wrapper.py](safety_wrapper.py) for safety features | |
| - Run [benchmark.py](benchmark.py) to test performance | |
| - See [evaluate.py](evaluate.py) for quality metrics | |
| ## Support | |
| For issues and questions: | |
| - Check the Hugging Face model page | |
| - Review existing issues | |
| - Submit a new issue with details | |
| ## License | |
| Apache 2.0 - See LICENSE file for details |