TTS cutoff issue

#17
by mavaxel - opened

I noticed that after a few turns in streaming mode, the TTS is cutting off the audio and is not completing the speech, although the text for the complete response has been generated.
I set "max units" to 20-80 and use context mode sliding window. This issue also persists with a very small previous context size.
The first few turns work perfectly, then after a while, the audio generation stops mid-sentence. It's not very consistent, so I can't nail down the issue.
I tried a few things, but nothing worked. Are you aware of that issue and maybe already have a solution?

Your model is excellent. When it works, it works perfectly, it's just this one issue that is causing me headaches.
I will keep trying and give you a feedback, but I hope you can provide a solution.
Cheers, Paul

OpenBMB org

Hi, thank you for your feedback!
I carefully read your question and wonder what inference code you are using. We have released a pytorch based inference code https://github.com/OpenBMB/MiniCPM-o-Demo
We believe pytorch version can solve your question.

I am using the streaming demo (the elevator video demo), but with webcam and microphone instead of a fixed video.
The audio cutoff seems to happen when the text fragment per unit needs more than one second audio - then it just mutes until the assistant turn is over.
Can I overcome the limit of generating only 1 second of audio per generator call? (streaming omni mode)
However, I am not sure if the longer text is really the problem, as it is not very predictable.

I set the "max_new_speak_tokens_per_chunk = 3", and it seems that sometimes the model generates more text tokens per unit, and the audio output cuts off.

Sign up or log in to comment