This is not a usable release. There is no voice cloning or presets.

by PierrunoYT - opened 29 days ago

Discussion

PierrunoYT

29 days ago

This is not a usable release. There is no voice cloning or presets.

kajode

Kugelaudio org 29 days ago

Will add them and voice cloning in the comming days stay tuned

evewashere

29 days ago

why would you release something half baked when you can just wait one week and release it all at once

johnhampel

29 days ago

https://github.com/Saganaki22/ComfyUI-KugelAudio

SquatchDev

27 days ago

why would you release something half baked when you can just wait one week and release it all at once

Just be grateful that you have an actual open source TTS model that can compete with ElevenLabs bro.

Alekseinodapeteov

27 days ago

•

edited 27 days ago

Just clone the kugelaudio space and you'll have everything, I did it that way and after some minor troubleshooting it works including the cloning. I also quantized it to 8bit, well claude did I think its 1 line of code. But yes the actual model folder here isn't particularly capable for any real applications on its own.

I wanna make it work with a bigger llm now but I've given up, don't think its possible without being a really goated ai python guy or retraining.

Alekseinodapeteov

27 days ago

•

edited 27 days ago

BTW has anyone found any cool inflection/emotion commands, the vibevoice ones just get read lol

evewashere

27 days ago

Just be grateful that you have an actual open source TTS model that can compete with ElevenLabs bro.

microsoft did the bulk of the work here, these guys just took government money, rented a few GPU's for a week and finetuned the model on a public youtube dataset.
Then they forked the VibeVoice repo, replaced some class names and removed voice cloning. (if you check the git commits, you can see that the earlier commits still have the voice cloning code 😭).
It really isn't that impressive.

YW55

27 days ago

You can use the Wan2GP tool to run this model and cloning working perfectly. The only limitations of this model is that the voices speed up too much if your audio is longer than 15 second and there's a hard limit around 40 seconds.

I tweaked the Wan2GP code to split long paragraphs into smaller chunks and splice it back together during the final output.

Ayuy

18 days ago

You can use the Wan2GP tool to run this model and cloning working perfectly. The only limitations of this model is that the voices speed up too much if your audio is longer than 15 second and there's a hard limit around 40 seconds.

I tweaked the Wan2GP code to split long paragraphs into smaller chunks and splice it back together during the final output.

How did you overcome the breaks at the end of sentences?

YW55

18 days ago

You can use the Wan2GP tool to run this model and cloning working perfectly. The only limitations of this model is that the voices speed up too much if your audio is longer than 15 second and there's a hard limit around 40 seconds.

I tweaked the Wan2GP code to split long paragraphs into smaller chunks and splice it back together during the final output.

How did you overcome the breaks at the end of sentences?

You can modify the tail_tokens variable behavior in kugelaudio_inference.py to make the sentence end quicker or longer. A bit of coding involved.

arrase

13 days ago

Will add them and voice cloning in the comming days stay tuned

any news?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment