Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,58 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: audio-text-to-text
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
- zh
|
| 7 |
+
base_model:
|
| 8 |
+
- Qwen/Qwen3-8B-Base
|
| 9 |
+
- openai/whisper-large-v3
|
| 10 |
+
---
|
| 11 |
+
MuFun model proposed in [Advancing the Foundation Model for Music Understanding](https://arxiv.org/abs/2508.01178)
|
| 12 |
+
|
| 13 |
+
## Usage
|
| 14 |
+
some audio processing packages like mutagen, torchaudio are needed to be installed
|
| 15 |
+
```python
|
| 16 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 17 |
+
hf_path = 'Yi3852/MuFun-Base'
|
| 18 |
+
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False)
|
| 19 |
+
device='cuda'
|
| 20 |
+
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True, torch_dtype="bfloat16")
|
| 21 |
+
model.to(device)
|
| 22 |
+
|
| 23 |
+
# single audio
|
| 24 |
+
# during inference the audio(converted to a sequence of embeddings) will be placed in the position of <audio> tag in the prompt
|
| 25 |
+
aud="/path/to/your/song.mp3"
|
| 26 |
+
inp="\n<audio>Can you listen to this song and tell me its lyrics?"
|
| 27 |
+
res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
|
| 28 |
+
print(res)
|
| 29 |
+
|
| 30 |
+
# multiple audios
|
| 31 |
+
# for multiple songs each will be placed in the coresponding <audio> tag in the prompt
|
| 32 |
+
aud=["/path/to/your/song1.mp3", '/path/to/your/song2.mp3']
|
| 33 |
+
inp="\n<audio> This is song1. <audio> This is song2. Which song do you like more? Tell me the reason."
|
| 34 |
+
res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
|
| 35 |
+
print(res)
|
| 36 |
+
|
| 37 |
+
# analyze only a specific segment of audio using the segs parameter
|
| 38 |
+
# format is [start_time, end_time](in seconds), for multiple audios segs can be passed like [[0,30],[60,90]], [None,[0,30.0]]
|
| 39 |
+
aud="/path/to/your/song.mp3"
|
| 40 |
+
inp="\n<audio>How is the rhythm of this music clip?"
|
| 41 |
+
res=model.chat(prompt=inp, audio_files=aud, segs=[0,30.0], tokenizer=tokenizer)
|
| 42 |
+
print(res)
|
| 43 |
+
|
| 44 |
+
# set audio_files=None will work, however it is not recommended to use it as a text model
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Citation
|
| 48 |
+
|
| 49 |
+
```bibtex
|
| 50 |
+
@misc{jiang2025advancingfoundationmodelmusic,
|
| 51 |
+
title={Advancing the Foundation Model for Music Understanding},
|
| 52 |
+
author={Yi Jiang and Wei Wang and Xianwen Guo and Huiyun Liu and Hanrui Wang and Youri Xu and Haoqi Gu and Zhongqian Xie and Chuanjiang Luo},
|
| 53 |
+
year={2025},
|
| 54 |
+
eprint={2508.01178},
|
| 55 |
+
archivePrefix={arXiv},
|
| 56 |
+
primaryClass={cs.SD},
|
| 57 |
+
url={https://arxiv.org/abs/2508.01178},
|
| 58 |
+
}
|