Yi3852 commited on
Commit
df3c3ac
·
verified ·
1 Parent(s): 4a3f22e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -3
README.md CHANGED
@@ -1,3 +1,58 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: audio-text-to-text
4
+ language:
5
+ - en
6
+ - zh
7
+ base_model:
8
+ - Qwen/Qwen3-8B-Base
9
+ - openai/whisper-large-v3
10
+ ---
11
+ MuFun model proposed in [Advancing the Foundation Model for Music Understanding](https://arxiv.org/abs/2508.01178)
12
+
13
+ ## Usage
14
+ some audio processing packages like mutagen, torchaudio are needed to be installed
15
+ ```python
16
+ from transformers import AutoTokenizer, AutoModelForCausalLM
17
+ hf_path = 'Yi3852/MuFun-Base'
18
+ tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False)
19
+ device='cuda'
20
+ model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True, torch_dtype="bfloat16")
21
+ model.to(device)
22
+
23
+ # single audio
24
+ # during inference the audio(converted to a sequence of embeddings) will be placed in the position of <audio> tag in the prompt
25
+ aud="/path/to/your/song.mp3"
26
+ inp="\n<audio>Can you listen to this song and tell me its lyrics?"
27
+ res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
28
+ print(res)
29
+
30
+ # multiple audios
31
+ # for multiple songs each will be placed in the coresponding <audio> tag in the prompt
32
+ aud=["/path/to/your/song1.mp3", '/path/to/your/song2.mp3']
33
+ inp="\n<audio> This is song1. <audio> This is song2. Which song do you like more? Tell me the reason."
34
+ res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
35
+ print(res)
36
+
37
+ # analyze only a specific segment of audio using the segs parameter
38
+ # format is [start_time, end_time](in seconds), for multiple audios segs can be passed like [[0,30],[60,90]], [None,[0,30.0]]
39
+ aud="/path/to/your/song.mp3"
40
+ inp="\n<audio>How is the rhythm of this music clip?"
41
+ res=model.chat(prompt=inp, audio_files=aud, segs=[0,30.0], tokenizer=tokenizer)
42
+ print(res)
43
+
44
+ # set audio_files=None will work, however it is not recommended to use it as a text model
45
+ ```
46
+
47
+ ## Citation
48
+
49
+ ```bibtex
50
+ @misc{jiang2025advancingfoundationmodelmusic,
51
+ title={Advancing the Foundation Model for Music Understanding},
52
+ author={Yi Jiang and Wei Wang and Xianwen Guo and Huiyun Liu and Hanrui Wang and Youri Xu and Haoqi Gu and Zhongqian Xie and Chuanjiang Luo},
53
+ year={2025},
54
+ eprint={2508.01178},
55
+ archivePrefix={arXiv},
56
+ primaryClass={cs.SD},
57
+ url={https://arxiv.org/abs/2508.01178},
58
+ }