This is not a usable release. There is no voice cloning or presets.
This is not a usable release. There is no voice cloning or presets.
Will add them and voice cloning in the comming days stay tuned
why would you release something half baked when you can just wait one week and release it all at once
why would you release something half baked when you can just wait one week and release it all at once
Just be grateful that you have an actual open source TTS model that can compete with ElevenLabs bro.
Just clone the kugelaudio space and you'll have everything, I did it that way and after some minor troubleshooting it works including the cloning. I also quantized it to 8bit, well claude did I think its 1 line of code. But yes the actual model folder here isn't particularly capable for any real applications on its own.
I wanna make it work with a bigger llm now but I've given up, don't think its possible without being a really goated ai python guy or retraining.
BTW has anyone found any cool inflection/emotion commands, the vibevoice ones just get read lol
Just be grateful that you have an actual open source TTS model that can compete with ElevenLabs bro.
microsoft did the bulk of the work here, these guys just took government money, rented a few GPU's for a week and finetuned the model on a public youtube dataset.
Then they forked the VibeVoice repo, replaced some class names and removed voice cloning. (if you check the git commits, you can see that the earlier commits still have the voice cloning code π).
It really isn't that impressive.
You can use the Wan2GP tool to run this model and cloning working perfectly. The only limitations of this model is that the voices speed up too much if your audio is longer than 15 second and there's a hard limit around 40 seconds.
I tweaked the Wan2GP code to split long paragraphs into smaller chunks and splice it back together during the final output.
You can use the Wan2GP tool to run this model and cloning working perfectly. The only limitations of this model is that the voices speed up too much if your audio is longer than 15 second and there's a hard limit around 40 seconds.
I tweaked the Wan2GP code to split long paragraphs into smaller chunks and splice it back together during the final output.
How did you overcome the breaks at the end of sentences?
You can use the Wan2GP tool to run this model and cloning working perfectly. The only limitations of this model is that the voices speed up too much if your audio is longer than 15 second and there's a hard limit around 40 seconds.
I tweaked the Wan2GP code to split long paragraphs into smaller chunks and splice it back together during the final output.
How did you overcome the breaks at the end of sentences?
You can modify the tail_tokens variable behavior in kugelaudio_inference.py to make the sentence end quicker or longer. A bit of coding involved.
Will add them and voice cloning in the comming days stay tuned
any news?