Instructions to use ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF", filename="SmolVLM2-2.2B-Instruct-Agentic-GUI-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M
Use Docker
docker model run hf.co/ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M
- Ollama
How to use ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF with Ollama:
ollama run hf.co/ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M
- Unsloth Studio new
How to use ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF to start chatting
- Docker Model Runner
How to use ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF with Docker Model Runner:
docker model run hf.co/ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M
- Lemonade
How to use ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.SmolVLM2-2.2B-Agentic-GUI-GGUF-Q4_K_M
List all available models
lemonade list
SmolVLM2-2.2B-Instruct-Agentic-GUI-GGUF
GGUF quantizations of smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI for use with llama.cpp.
This model is a fine-tuned version of SmolVLM2-2.2B-Instruct trained on the aguvis-stage-2 dataset (~630K GUI interaction examples) for on-screen GUI element detection and interaction. Given a screenshot and a task description, it outputs normalized [0, 1] coordinates for click targets.
Model Files
| File | Size | Description |
|---|---|---|
SmolVLM2-2.2B-Instruct-Agentic-GUI-F16.gguf |
3.4 GB | Full precision (FP16) text model |
SmolVLM2-2.2B-Instruct-Agentic-GUI-Q8_0.gguf |
1.8 GB | 8-bit quantized text model |
SmolVLM2-2.2B-Instruct-Agentic-GUI-Q4_K_M.gguf |
1.0 GB | 4-bit quantized text model (recommended for mobile) |
SmolVLM2-2.2B-Instruct-Agentic-GUI-mmproj-f16.gguf |
832 MB | Vision encoder (SigLIP, full precision, always required) |
You need one text model + the mmproj. Pick a text model based on your hardware, then always download the mmproj alongside it.
Capabilities
- click(x, y) - Click on a UI element at normalized coordinates
- type(text) - Type text at the current cursor position
- scroll(x, y, direction) - Scroll in a given direction
- drag(x1, y1, x2, y2) - Drag from one position to another
- key(key_name) - Press a keyboard key
All coordinates are normalized to [0, 1] range where (0, 0) is top-left and (1, 1) is bottom-right.
Quick Start with llama.cpp
Prerequisites
Build llama.cpp with multimodal support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_BUILD_COMMON=ON
cmake --build build -j
Download Models
# Download both files (text model + vision encoder)
huggingface-cli download ahmadw/SmolVLM2-2.2B-Agentic-GUI-GGUF \
--local-dir models/
Run Server
./build/bin/llama-server \
-m models/SmolVLM2-2.2B-Instruct-Agentic-GUI-Q4_K_M.gguf \
--mmproj models/SmolVLM2-2.2B-Instruct-Agentic-GUI-mmproj-f16.gguf \
-c 4096 -ngl 99 --port 8888 \
--chat-template smolvlm
Run CLI
./build/bin/llama-mtmd-cli \
-m models/SmolVLM2-2.2B-Instruct-Agentic-GUI-Q4_K_M.gguf \
--mmproj models/SmolVLM2-2.2B-Instruct-Agentic-GUI-mmproj-f16.gguf \
-c 4096 -ngl 99 \
--image screenshot.png \
-p "Click on the search button"
Prompt Format
This model uses the SmolVLM idefics-style chat template (NOT ChatML). llama.cpp has built-in support via --chat-template smolvlm.
The raw template structure:
<|im_start|>System: {system_prompt}<end_of_utterance>
User:<image>{task_instruction}<end_of_utterance>
Assistant:
Key details:
<|im_start|>is the BOS token and appears only once<end_of_utterance>terminates each turn (not<|im_end|>)<image>is replaced by the vision encoder's image tokens
System Prompt
You are a helpful assistant that can interact with a computer screen.
You can use the following tools to interact with the screen:
- click(start_x, start_y) - Click on a specific position on the screen.
- type(text) - Type a string of text.
- scroll(start_x, start_y, direction) - Scroll in a direction.
- key(key_name) - Press a specific key.
- drag(start_x, start_y, end_x, end_y) - Drag from one position to another.
- wait(seconds) - Wait for a specified number of seconds.
Important guidelines:
- All coordinates are normalized to [0, 1] range, where (0, 0) is the
top-left corner of the screen and (1, 1) is the bottom-right corner.
- Coordinates should be the center of the element you want to interact with.
Example Output
Given a screenshot and the instruction "Click the Settings button", the model outputs:
click(x=0.491, y=0.073)
Image Preprocessing
Images should be resized so the longest edge is 1152 pixels while preserving aspect ratio. The vision encoder uses SigLIP with 384px tiles and 3x merge, resulting in a 1152px effective resolution.
API Usage (Server Mode)
# Send a screenshot with a task
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."} },
{"type": "text", "text": "Click on the search bar"}
]
}
],
"max_tokens": 128,
"temperature": 0
}'
Technical Details
| Property | Value |
|---|---|
| Architecture | SmolVLM (idefics3-based) |
| Parameters | 2.2B (text) + SigLIP vision encoder |
| Text Backbone | SmolLM2-1.7B-Instruct |
| Vision Encoder | SigLIP-SO400M-patch14-384 |
| Image Resolution | 1152px longest edge (3x384 tiles) |
| Context Length | 4096 tokens |
| Coordinate Format | Normalized [0, 1] float |
| Training Data | aguvis-stage-2 (~630K GUI examples) |
| Original Format | Safetensors (BF16) |
| Quantization Method | llama.cpp convert_hf_to_gguf.py |
Conversion Details
These GGUFs were converted directly from the source model weights using llama.cpp's convert_hf_to_gguf.py. Third-party pre-quantized GGUFs (e.g., from automated quantization services) were found to produce incorrect output (pixel coordinates instead of normalized [0,1] coordinates), likely due to missing fine-tuned layer weights during conversion.
License
Apache 2.0 (inherited from the base model SmolVLM2-2.2B-Instruct)
Credits
- Base Model: HuggingFaceTB/SmolVLM2-2.2B-Instruct
- Fine-Tuned Model: smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI
- Training Dataset: smolagents/aguvis-stage-2
- Inference Engine: llama.cpp
- Paper: SmolVLM: Redefining small and efficient multimodal models (Marafioti et al., 2025)
- Downloads last month
- 133
4-bit
8-bit
16-bit