Enhance model card with metadata, abstract, overview, and usage example
Browse filesThis PR significantly enhances the model card by:
- Adding `pipeline_tag: text-generation` to ensure the model appears in relevant searches and filter categories.
- Adding `library_name: transformers` to indicate compatibility with the Hugging Face Transformers library, enabling the "Use in Transformers" widget.
- Including relevant `tags` such as `agent`, `tool-use`, `reinforcement-learning`, `qwen`, and `llm` for better categorization.
- Expanding the model card content with the paper's abstract, a detailed overview including key highlights and visuals from the project's GitHub, and comprehensive usage instructions for text generation with the `transformers` library.
- Consolidating existing links and adding new ones (Hugging Face collection, Hugging Face Space demo) for a richer set of resources.
- Incorporating additional valuable sections from the GitHub README, such as Citation, Acknowledgements, and Contact information.
|
@@ -1,11 +1,131 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
|
|
|
|
|
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- agent
|
| 7 |
+
- tool-use
|
| 8 |
+
- reinforcement-learning
|
| 9 |
+
- qwen
|
| 10 |
+
- llm
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# ARPO: Agentic Reinforced Policy Optimization
|
| 14 |
|
| 15 |
+
The model checkpoint of ARPO is released for the paper [**Agentic Reinforced Policy Optimization**](https://huggingface.co/papers/2507.19849).
|
| 16 |
|
| 17 |
+
<div align="center">
|
| 18 |
+
<img src="https://github.com/dongguanting/ARPO/blob/main/logo1.png" width="150px">
|
| 19 |
+
</div>
|
| 20 |
|
| 21 |
+
<div align="center">
|
| 22 |
+
[](https://arxiv.org/abs/2507.19849)
|
| 23 |
+
[](https://huggingface.co/papers/2507.19849)
|
| 24 |
+
[](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
|
| 25 |
+
[](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
|
| 26 |
+
[](https://github.com/dongguanting/ARPO)
|
| 27 |
+
[](https://huggingface.co/spaces/dongguanting/ARPO-DeepSearch-Viewer)
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
## Abstract
|
| 31 |
+
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at this https URL
|
| 32 |
+
|
| 33 |
+
## ๐ก Overview
|
| 34 |
+
|
| 35 |
+
We propose **Agentic Reinforced Policy Optimization (ARPO)**, **an agentic RL algorithm tailored for training multi-turn LLM-based agent**. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors.
|
| 36 |
+
|
| 37 |
+
<img width="1686" height="866" alt="intro" src="https://github.com/user-attachments/assets/8b9daf54-c4ba-4e79-bf79-f98b5a893edd" />
|
| 38 |
+
|
| 39 |
+
### Key Highlights
|
| 40 |
+
- **Entropy-based Adaptive Rollout**: ARPO incorporates an entropy-based adaptive rollout mechanism that dynamically balances global trajectory sampling and step-level sampling, promoting exploration at steps with high uncertainty after tool usage.
|
| 41 |
+
- **Advantage Attribution Estimation**: By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions.
|
| 42 |
+
- **Superior Performance**: Achieves improved performance across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains.
|
| 43 |
+
- **Efficient Tool Usage**: Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods.
|
| 44 |
+
|
| 45 |
+
## ๐ Quick Start: Text Generation with Transformers
|
| 46 |
+
|
| 47 |
+
You can use ARPO models for text generation or chat completion tasks via the Hugging Face `transformers` library.
|
| 48 |
+
|
| 49 |
+
First, ensure you have the necessary dependencies installed, including `transformers` and `torch`. You may also need `flash-attn` for optimized performance:
|
| 50 |
+
```bash
|
| 51 |
+
pip install transformers torch flash-attn --no-build-isolation
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
### Text Generation Example
|
| 55 |
+
|
| 56 |
+
```python
|
| 57 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 58 |
+
import torch
|
| 59 |
+
|
| 60 |
+
# Load the model and tokenizer
|
| 61 |
+
# This specific model is Qwen3-based. For other ARPO models, check the ARPO collection.
|
| 62 |
+
model_id = "dongguanting/Qwen3-8B-ARPO-DeepSearch"
|
| 63 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 64 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 65 |
+
model_id,
|
| 66 |
+
torch_dtype=torch.bfloat16, # Use torch.float16 for smaller GPUs
|
| 67 |
+
device_map="auto",
|
| 68 |
+
trust_remote_code=True,
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
# Prepare your prompt using the chat template (recommended for Qwen-based models)
|
| 72 |
+
messages = [
|
| 73 |
+
{"role": "system", "content": "You are a helpful AI assistant specialized in reasoning tasks."},
|
| 74 |
+
{"role": "user", "content": "What is the capital of France?"}
|
| 75 |
+
]
|
| 76 |
+
text = tokenizer.apply_chat_template(
|
| 77 |
+
messages,
|
| 78 |
+
tokenize=False,
|
| 79 |
+
add_generation_prompt=True
|
| 80 |
+
)
|
| 81 |
+
|
| 82 |
+
# Generate response
|
| 83 |
+
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
|
| 84 |
+
generated_ids = model.generate(
|
| 85 |
+
input_ids,
|
| 86 |
+
max_new_tokens=100,
|
| 87 |
+
temperature=0.7,
|
| 88 |
+
do_sample=True,
|
| 89 |
+
eos_token_id=tokenizer.eos_token_id
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
# Decode and print the output
|
| 93 |
+
response = tokenizer.decode(generated_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
|
| 94 |
+
print(response)
|
| 95 |
+
|
| 96 |
+
# Example for direct text generation (without chat template):
|
| 97 |
+
# prompt = "The capital of Germany is"
|
| 98 |
+
# input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
|
| 99 |
+
# output_ids = model.generate(input_ids, max_new_tokens=50, temperature=0.7, top_p=0.9, do_sample=True)
|
| 100 |
+
# generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
| 101 |
+
# print(generated_text)
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
For more detailed usage, including specific tool-use agentic setups and evaluation, please refer to the [official GitHub repository](https://github.com/dongguanting/ARPO).
|
| 105 |
+
|
| 106 |
+
## ๐ Citation
|
| 107 |
+
|
| 108 |
+
If you find this work helpful, please cite our paper:
|
| 109 |
+
```bibtex
|
| 110 |
+
@misc{dong2025arpo,
|
| 111 |
+
title={Agentic Reinforced Policy Optimization},
|
| 112 |
+
author={Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou},
|
| 113 |
+
year={2025},
|
| 114 |
+
eprint={2507.19849},
|
| 115 |
+
archivePrefix={arXiv},
|
| 116 |
+
primaryClass={cs.LG},
|
| 117 |
+
url={https://arxiv.org/abs/2507.19849},
|
| 118 |
+
}
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## ๐ค Acknowledge
|
| 122 |
+
|
| 123 |
+
This training implementation builds upon [Tool-Star](https://github.com/dongguanting/Tool-Star), [Llama Factory](https://github.com/hiyouga/LLaMA-Factory), [verl](https://github.com/volcengine/verl) and [ReCall](https://github.com/Agent-RL/ReCall). For evaluation, we rely on [WebThinker](https://github.com/RUC-NLPIR/WebThinker), [HIRA](https://github.com/RUC-NLPIR/HiRA), [WebSailor](https://github.com/Alibaba-NLP/WebAgent), [Search-o1](https://github.com/sunnynexus/Search-o1), and [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG). The Python interpreter design references [ToRA](https://github.com/microsoft/ToRA) and [ToRL](https://github.com/GAIR-NLP/ToRL), while our models are trained using [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5/). We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.
|
| 124 |
+
|
| 125 |
+
## ๐ License
|
| 126 |
+
|
| 127 |
+
This project is released under the [MIT License](LICENSE).
|
| 128 |
+
|
| 129 |
+
## ๐ Contact
|
| 130 |
+
|
| 131 |
+
For any questions or feedback, please reach out to us at [dongguanting@ruc.edu.cn](dongguanting@ruc.edu.cn).
|