openbmb
/

MiniCPM-o-4_5

feature-extraction

Model card Files Files and versions

TTS cutoff issue

#17

by mavaxel - opened 10 days ago

•

edited 10 days ago

I noticed that after a few turns in streaming mode, the TTS is cutting off the audio and is not completing the speech, although the text for the complete response has been generated.
I set "max units" to 20-80 and use context mode sliding window. This issue also persists with a very small previous context size.
The first few turns work perfectly, then after a while, the audio generation stops mid-sentence. It's not very consistent, so I can't nail down the issue.
I tried a few things, but nothing worked. Are you aware of that issue and maybe already have a solution?

Your model is excellent. When it works, it works perfectly, it's just this one issue that is causing me headaches.
I will keep trying and give you a feedback, but I hope you can provide a solution.
Cheers, Paul

OpenBMB org 10 days ago

Hi, thank you for your feedback!
I carefully read your question and wonder what inference code you are using. We have released a pytorch based inference code https://github.com/OpenBMB/MiniCPM-o-Demo
We believe pytorch version can solve your question.

•

edited 10 days ago

I am using the streaming demo (the elevator video demo), but with webcam and microphone instead of a fixed video.
The audio cutoff seems to happen when the text fragment per unit needs more than one second audio - then it just mutes until the assistant turn is over.
Can I overcome the limit of generating only 1 second of audio per generator call? (streaming omni mode)
However, I am not sure if the longer text is really the problem, as it is not very predictable.

I set the "max_new_speak_tokens_per_chunk = 3", and it seems that sometimes the model generates more text tokens per unit, and the audio output cuts off.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment