nielsr HF Staff commited on
Commit
9fd46d2
ยท
verified ยท
1 Parent(s): 99e8c96

Enhance model card with metadata, abstract, overview, and usage example

Browse files

This PR significantly enhances the model card by:
- Adding `pipeline_tag: text-generation` to ensure the model appears in relevant searches and filter categories.
- Adding `library_name: transformers` to indicate compatibility with the Hugging Face Transformers library, enabling the "Use in Transformers" widget.
- Including relevant `tags` such as `agent`, `tool-use`, `reinforcement-learning`, `qwen`, and `llm` for better categorization.
- Expanding the model card content with the paper's abstract, a detailed overview including key highlights and visuals from the project's GitHub, and comprehensive usage instructions for text generation with the `transformers` library.
- Consolidating existing links and adding new ones (Hugging Face collection, Hugging Face Space demo) for a richer set of resources.
- Incorporating additional valuable sections from the GitHub README, such as Citation, Acknowledgements, and Contact information.

Files changed (1) hide show
  1. README.md +124 -4
README.md CHANGED
@@ -1,11 +1,131 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- The model checkpoint of ARPO:
6
 
7
- Arxiv: https://arxiv.org/abs/2507.19849
8
 
9
- HF paper: https://huggingface.co/papers/2507.19849
 
 
10
 
11
- Github: https://github.com/dongguanting/ARPO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ tags:
6
+ - agent
7
+ - tool-use
8
+ - reinforcement-learning
9
+ - qwen
10
+ - llm
11
  ---
12
 
13
+ # ARPO: Agentic Reinforced Policy Optimization
14
 
15
+ The model checkpoint of ARPO is released for the paper [**Agentic Reinforced Policy Optimization**](https://huggingface.co/papers/2507.19849).
16
 
17
+ <div align="center">
18
+ <img src="https://github.com/dongguanting/ARPO/blob/main/logo1.png" width="150px">
19
+ </div>
20
 
21
+ <div align="center">
22
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-b5212f.svg?logo=arxiv)](https://arxiv.org/abs/2507.19849)
23
+ [![Paper](https://img.shields.io/badge/Paper-Hugging%20Face-yellow?logo=huggingface)](https://huggingface.co/papers/2507.19849)
24
+ [![Model Collection](https://img.shields.io/badge/Model-Hugging%20Face-blue?logo=huggingface)](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
25
+ [![Dataset Collection](https://img.shields.io/badge/Dataset-Hugging%20Face-blue?logo=huggingface)](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
26
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-blue?logo=github)](https://github.com/dongguanting/ARPO)
27
+ [![Demo](https://img.shields.io/badge/Demo-Hugging%20Face%20Space-orange?logo=huggingface)](https://huggingface.co/spaces/dongguanting/ARPO-DeepSearch-Viewer)
28
+ </div>
29
+
30
+ ## Abstract
31
+ Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at this https URL
32
+
33
+ ## ๐Ÿ’ก Overview
34
+
35
+ We propose **Agentic Reinforced Policy Optimization (ARPO)**, **an agentic RL algorithm tailored for training multi-turn LLM-based agent**. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors.
36
+
37
+ <img width="1686" height="866" alt="intro" src="https://github.com/user-attachments/assets/8b9daf54-c4ba-4e79-bf79-f98b5a893edd" />
38
+
39
+ ### Key Highlights
40
+ - **Entropy-based Adaptive Rollout**: ARPO incorporates an entropy-based adaptive rollout mechanism that dynamically balances global trajectory sampling and step-level sampling, promoting exploration at steps with high uncertainty after tool usage.
41
+ - **Advantage Attribution Estimation**: By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions.
42
+ - **Superior Performance**: Achieves improved performance across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains.
43
+ - **Efficient Tool Usage**: Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods.
44
+
45
+ ## ๐Ÿƒ Quick Start: Text Generation with Transformers
46
+
47
+ You can use ARPO models for text generation or chat completion tasks via the Hugging Face `transformers` library.
48
+
49
+ First, ensure you have the necessary dependencies installed, including `transformers` and `torch`. You may also need `flash-attn` for optimized performance:
50
+ ```bash
51
+ pip install transformers torch flash-attn --no-build-isolation
52
+ ```
53
+
54
+ ### Text Generation Example
55
+
56
+ ```python
57
+ from transformers import AutoModelForCausalLM, AutoTokenizer
58
+ import torch
59
+
60
+ # Load the model and tokenizer
61
+ # This specific model is Qwen3-based. For other ARPO models, check the ARPO collection.
62
+ model_id = "dongguanting/Qwen3-8B-ARPO-DeepSearch"
63
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
64
+ model = AutoModelForCausalLM.from_pretrained(
65
+ model_id,
66
+ torch_dtype=torch.bfloat16, # Use torch.float16 for smaller GPUs
67
+ device_map="auto",
68
+ trust_remote_code=True,
69
+ )
70
+
71
+ # Prepare your prompt using the chat template (recommended for Qwen-based models)
72
+ messages = [
73
+ {"role": "system", "content": "You are a helpful AI assistant specialized in reasoning tasks."},
74
+ {"role": "user", "content": "What is the capital of France?"}
75
+ ]
76
+ text = tokenizer.apply_chat_template(
77
+ messages,
78
+ tokenize=False,
79
+ add_generation_prompt=True
80
+ )
81
+
82
+ # Generate response
83
+ input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
84
+ generated_ids = model.generate(
85
+ input_ids,
86
+ max_new_tokens=100,
87
+ temperature=0.7,
88
+ do_sample=True,
89
+ eos_token_id=tokenizer.eos_token_id
90
+ )
91
+
92
+ # Decode and print the output
93
+ response = tokenizer.decode(generated_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
94
+ print(response)
95
+
96
+ # Example for direct text generation (without chat template):
97
+ # prompt = "The capital of Germany is"
98
+ # input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
99
+ # output_ids = model.generate(input_ids, max_new_tokens=50, temperature=0.7, top_p=0.9, do_sample=True)
100
+ # generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
101
+ # print(generated_text)
102
+ ```
103
+
104
+ For more detailed usage, including specific tool-use agentic setups and evaluation, please refer to the [official GitHub repository](https://github.com/dongguanting/ARPO).
105
+
106
+ ## ๐Ÿ“„ Citation
107
+
108
+ If you find this work helpful, please cite our paper:
109
+ ```bibtex
110
+ @misc{dong2025arpo,
111
+ title={Agentic Reinforced Policy Optimization},
112
+ author={Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou},
113
+ year={2025},
114
+ eprint={2507.19849},
115
+ archivePrefix={arXiv},
116
+ primaryClass={cs.LG},
117
+ url={https://arxiv.org/abs/2507.19849},
118
+ }
119
+ ```
120
+
121
+ ## ๐Ÿค Acknowledge
122
+
123
+ This training implementation builds upon [Tool-Star](https://github.com/dongguanting/Tool-Star), [Llama Factory](https://github.com/hiyouga/LLaMA-Factory), [verl](https://github.com/volcengine/verl) and [ReCall](https://github.com/Agent-RL/ReCall). For evaluation, we rely on [WebThinker](https://github.com/RUC-NLPIR/WebThinker), [HIRA](https://github.com/RUC-NLPIR/HiRA), [WebSailor](https://github.com/Alibaba-NLP/WebAgent), [Search-o1](https://github.com/sunnynexus/Search-o1), and [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG). The Python interpreter design references [ToRA](https://github.com/microsoft/ToRA) and [ToRL](https://github.com/GAIR-NLP/ToRL), while our models are trained using [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5/). We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.
124
+
125
+ ## ๐Ÿ“„ License
126
+
127
+ This project is released under the [MIT License](LICENSE).
128
+
129
+ ## ๐Ÿ“ž Contact
130
+
131
+ For any questions or feedback, please reach out to us at [dongguanting@ruc.edu.cn](dongguanting@ruc.edu.cn).