Instructions to use microsoft/MAI-DS-R1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/MAI-DS-R1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/MAI-DS-R1", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/MAI-DS-R1", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("microsoft/MAI-DS-R1", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/MAI-DS-R1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/MAI-DS-R1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/MAI-DS-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/microsoft/MAI-DS-R1
- SGLang
How to use microsoft/MAI-DS-R1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/MAI-DS-R1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/MAI-DS-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/MAI-DS-R1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/MAI-DS-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use microsoft/MAI-DS-R1 with Docker Model Runner:
docker model run hf.co/microsoft/MAI-DS-R1
Can anyone benchmark it against DeepSeekR10528? I didn't find any precise benchmark data.
Because I feel that the two models are similar in tone and style, I want to see how the benchmark compares.
I found that this model is not bad. It is a fine-tuning of R1. The answer style and reasoning are different, but it is not an ideological fine-tuning. I don’t think there is anything wrong with its answer. It didn’t even find the boundary of slandering the Chinese Communist Party.
For jailbreaking, it seems to be more difficult than R1. After jailbreaking, R1’s answer is still coherent and logical, but it is easy to reject again, and the answer effect is also relatively poor.
Therefore, the last thing it should do is to benchmark with R1-1776. A model built on ideology does not need to test the benchmark.
🧠 Evaluation on General Knowledge and Reasoning
| Categories | Benchmarks | Metrics | DS-R1 | R1-0528 | MAI-DS-R1 |
|---|---|---|---|---|---|
| General Knowledge | anli_r30 | 7-shot Acc | 0.686 | 0.673 | 0.697 |
| arc_challenge | 10-shot Acc | 0.963 | 0.963 | 0.963 | |
| hellaswag | 5-shot Acc | 0.864 | 0.860 | 0.859 | |
| mmlu (all) | 5-shot Acc | 0.867 | 0.863 | 0.870 | |
| mmlu/humanities | 5-shot Acc | 0.794 | 0.784 | 0.801 | |
| mmlu/other | 5-shot Acc | 0.883 | 0.879 | 0.886 | |
| mmlu/social_sciences | 5-shot Acc | 0.916 | 0.916 | 0.914 | |
| mmlu/STEM | 5-shot Acc | 0.867 | 0.864 | 0.870 | |
| openbookqa | 10-shot Acc | 0.936 | 0.938 | 0.954 | |
| Piqa | 5-shot Acc | 0.933 | 0.926 | 0.939 | |
| Winogrande | 5-shot Acc | 0.843 | 0.834 | 0.850 | |
| Math | gsm8k_chain_of_thought | 0-shot Accuracy | 0.953 | 0.954 | 0.949 |
| Math | 4-shot Accuracy | 0.833 | 0.853 | 0.843 | |
| mgsm_chain_of_thought_en | 0-shot Accuracy | 0.972 | 0.968 | 0.976 | |
| mgsm_chain_of_thought_zh | 0-shot Accuracy | 0.880 | 0.796 | 0.900 | |
| AIME 2024 | Pass@1, n=2 | 0.7333 | 0.7333 | 0.7333 | |
| Code | humaneval | 0-shot Accuracy | 0.866 | 0.841 | 0.860 |
| livecodebench (8k tokens) | 0-shot Pass@1 | 0.531 | 0.484 | 0.632 | |
| LCB_coding_completion | 0-shot Pass@1 | 0.260 | 0.200 | 0.540 | |
| LCB_generation | 0-shot Pass@1 | 0.700 | 0.670 | 0.692 | |
| mbpp | 3-shot Pass@1 | 0.897 | 0.874 | 0.911 |
🚫 Evaluation on Blocked Topics
| Benchmark | Metric | DS-R1 | R1-0528 | MAI-DS-R1 |
|---|---|---|---|---|
| Blocked topics test set | Answer Satisfaction | 1.68 | 2.76 | 3.62 |
| % uncensored | 30.7 | 99.1 | 99.3 |
🔐 Evaluation on Safety
| Categories | DS-R1 (Answer) | R1-0528(Answer) | MAI-DS-R1 (Answer) | DS-R1 (Thinking) | R1-0528(Thinking) | MAI-DS-R1 (Thinking) |
|---|---|---|---|---|---|---|
| Micro Attack Success Rate | 0.441 | 0.481 | 0.209 | 0.394 | 0.325 | 0.134 |
| Functional Standard | 0.258 | 0.289 | 0.126 | 0.302 | 0.214 | 0.082 |
| Functional Contextual | 0.494 | 0.556 | 0.321 | 0.506 | 0.395 | 0.309 |
| Functional Copyright | 0.750 | 0.787 | 0.263 | 0.463 | 0.475 | 0.062 |
| Semantic Misinfo/Disinfo | 0.500 | 0.648 | 0.315 | 0.519 | 0.500 | 0.259 |
| Semantic Chemical/Bio | 0.357 | 0.429 | 0.143 | 0.500 | 0.286 | 0.167 |
| Semantic Illegal | 0.189 | 0.170 | 0.019 | 0.321 | 0.245 | 0.019 |
| Semantic Harmful | 0.111 | 0.111 | 0.111 | 0.111 | 0.111 | 0.000 |
| Semantic Copyright | 0.750 | 0.787 | 0.263 | 0.463 | 0.475 | 0.062 |
| Semantic Cybercrime | 0.519 | 0.500 | 0.385 | 0.385 | 0.212 | 0.308 |
| Semantic Harassment | 0.000 | 0.048 | 0.000 | 0.048 | 0.048 | 0.000 |
| Num Parse Errors | 4 | 20 | 0 | 26 | 67 | 0 |
📌 Summary
- General Knowledge & Reasoning: MAI-DS-R1 performs on par with DeepSeek-R1 and slightly better than R1-0528, particularly excelling in
mgsm_chain_of_thought_zh, where R1-0528showed a notable drop. - Blocked Topics: MAI-DS-R1 blocks 99.3% of problematic prompts (matching R1-0528) and scores highest in Answer Satisfaction.
- Safety: MAI-DS-R1 significantly outperforms both DS-R1 and R1-0528in safety categories, especially in reducing harmful, illegal, or misleading outputs.