Improve model card: Add `transformers` library, `any-to-any` pipeline tag, and update links (#5)
Browse files- Improve model card: Add `transformers` library, `any-to-any` pipeline tag, and update links (7bed6717f1ce2cf58039e45882af3c9394cc356d)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
|
@@ -1,5 +1,7 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
<div align="center">
|
|
@@ -8,7 +10,7 @@ license: apache-2.0
|
|
| 8 |
|
| 9 |
<div align="center" style="line-height: 1;">
|
| 10 |
<a href="https://github.com/stepfun-ai/Step-Audio2" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a>  
|
| 11 |
-
<a href="https://stepfun.com/" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a>  
|
| 12 |
<a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a>  
|
| 13 |
<a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
|
| 14 |
</div>
|
|
@@ -24,7 +26,7 @@ license: apache-2.0
|
|
| 24 |
## Introduction
|
| 25 |
|
| 26 |
|
| 27 |
-
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
|
| 28 |
|
| 29 |
- **Advanced Speech and Audio Understanding**: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
|
| 30 |
|
|
@@ -32,7 +34,7 @@ Step-Audio 2 is an end-to-end multi-modal large language model designed for indu
|
|
| 32 |
|
| 33 |
- **Tool Calling and Multimodal RAG**: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
|
| 34 |
|
| 35 |
-
- **State-of-the-Art Performance**: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and [Technical Report](https://
|
| 36 |
|
| 37 |
+ **Open-source**: [Step-Audio 2 mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) and [Step-Audio 2 mini Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) are released under [Apache 2.0](LICENSE) license.
|
| 38 |
|
|
@@ -198,6 +200,7 @@ CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A ind
|
|
| 198 |
<td align="center">7.01</td>
|
| 199 |
<td align="center">2.68</td>
|
| 200 |
<td align="center"><strong>2.53</strong></td>
|
|
|
|
| 201 |
</tr>
|
| 202 |
<tr>
|
| 203 |
<td align="left">KeSpeech phase1</td>
|
|
@@ -758,7 +761,7 @@ URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation,
|
|
| 758 |
<td align="center"><strong>83.32</strong></td>
|
| 759 |
<td align="center"><strong>91.05</strong></td>
|
| 760 |
<td align="center"><strong>75.45</strong></td>
|
| 761 |
-
<
|
| 762 |
<td align="center">68.25</td>
|
| 763 |
<td align="center">74.78</td>
|
| 764 |
<td align="center"><strong>63.18</strong></td>
|
|
@@ -838,7 +841,7 @@ URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation,
|
|
| 838 |
<td align="center">60.12</td>
|
| 839 |
<td align="center">77.65</td>
|
| 840 |
<td align="center">61.25</td>
|
| 841 |
-
<
|
| 842 |
<td align="center">61.94</td>
|
| 843 |
<td align="center">63.80</td>
|
| 844 |
</tr>
|
|
@@ -866,4 +869,4 @@ The model and code in the repository is licensed under [Apache 2.0](LICENSE) Lic
|
|
| 866 |
primaryClass={cs.CL},
|
| 867 |
url={https://arxiv.org/abs/2507.16632},
|
| 868 |
}
|
| 869 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: any-to-any
|
| 5 |
---
|
| 6 |
|
| 7 |
<div align="center">
|
|
|
|
| 10 |
|
| 11 |
<div align="center" style="line-height: 1;">
|
| 12 |
<a href="https://github.com/stepfun-ai/Step-Audio2" target="_blank"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-StepFun-white?logo=github&logoColor=white"/></a>  
|
| 13 |
+
<a href="https://www.stepfun.com/docs/en/step-audio2" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-StepFun-white?logo=StepFun&logoColor=white"/></a>  
|
| 14 |
<a href="https://x.com/StepFun_ai" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-StepFun-white?logo=x&logoColor=white"/></a>  
|
| 15 |
<a href="https://discord.com/invite/XHheP5Fn" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-StepFun-white?logo=discord&logoColor=white"/></a>
|
| 16 |
</div>
|
|
|
|
| 26 |
## Introduction
|
| 27 |
|
| 28 |
|
| 29 |
+
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation, presented in the paper [Step-Audio 2 Technical Report](https://huggingface.co/papers/2507.16632).
|
| 30 |
|
| 31 |
- **Advanced Speech and Audio Understanding**: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
|
| 32 |
|
|
|
|
| 34 |
|
| 35 |
- **Tool Calling and Multimodal RAG**: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
|
| 36 |
|
| 37 |
+
- **State-of-the-Art Performance**: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and [Technical Report](https://huggingface.co/papers/2507.16632)).
|
| 38 |
|
| 39 |
+ **Open-source**: [Step-Audio 2 mini](https://huggingface.co/stepfun-ai/Step-Audio-2-mini) and [Step-Audio 2 mini Base](https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base) are released under [Apache 2.0](LICENSE) license.
|
| 40 |
|
|
|
|
| 200 |
<td align="center">7.01</td>
|
| 201 |
<td align="center">2.68</td>
|
| 202 |
<td align="center"><strong>2.53</strong></td>
|
| 203 |
+
<td align="center">2.53</td>
|
| 204 |
</tr>
|
| 205 |
<tr>
|
| 206 |
<td align="left">KeSpeech phase1</td>
|
|
|
|
| 761 |
<td align="center"><strong>83.32</strong></td>
|
| 762 |
<td align="center"><strong>91.05</strong></td>
|
| 763 |
<td align="center"><strong>75.45</strong></td>
|
| 764 |
+
<align="center"><strong>86.08</strong></td>
|
| 765 |
<td align="center">68.25</td>
|
| 766 |
<td align="center">74.78</td>
|
| 767 |
<td align="center"><strong>63.18</strong></td>
|
|
|
|
| 841 |
<td align="center">60.12</td>
|
| 842 |
<td align="center">77.65</td>
|
| 843 |
<td align="center">61.25</td>
|
| 844 |
+
<align="center">58.79</td>
|
| 845 |
<td align="center">61.94</td>
|
| 846 |
<td align="center">63.80</td>
|
| 847 |
</tr>
|
|
|
|
| 869 |
primaryClass={cs.CL},
|
| 870 |
url={https://arxiv.org/abs/2507.16632},
|
| 871 |
}
|
| 872 |
+
```
|