Details?

#1
by yukiarimo - opened

Hello!

What is the codec used? Training code? Also, there’s no 48 kHz, stop lying. It’s called upsampling!

Codec used is bicodec, it splits audio into semantic tokens and global(acoustic) tokens.

Training code will come soon, it’s pretty simple since it’s standard LLM based tts model, and yes I use upsampling instead of just training the bicodec model since it’s incredibly fast and pretty good quality.

I could maybe do native 48khz bicodec training in the future if this project gets enough interest for maximum speed and quality. I just didn’t do it since it takes several days of testing and then several more days of training.

Theoretically, you can fine-tune only decoder, so input and tokens will be the same, but output is 48 kHz.

Also, have you tried using phonemes? I want to build phoneme-based LLM TTS with LJSpeech

@yukiarimo Yep good idea, this is similar to NandemoGHS's xcodec2 model.

This model doesn't support phonemes because of its base model but yes, I'm experimenting with a custom smaller LLM TTS model I can train from scratch with phonemes, audio events, etc. It will also use an improved version of LayaCodec so it's even faster with a permissive license. However, that will take some time.

Sign up or log in to comment