Details?

by yukiarimo - opened 16 days ago

Discussion

yukiarimo

16 days ago

Hello!

What is the codec used? Training code? Also, there’s no 48 kHz, stop lying. It’s called upsampling!

YatharthS

Owner 16 days ago

Codec used is bicodec, it splits audio into semantic tokens and global(acoustic) tokens.

Training code will come soon, it’s pretty simple since it’s standard LLM based tts model, and yes I use upsampling instead of just training the bicodec model since it’s incredibly fast and pretty good quality.

I could maybe do native 48khz bicodec training in the future if this project gets enough interest for maximum speed and quality. I just didn’t do it since it takes several days of testing and then several more days of training.

yukiarimo

15 days ago

Theoretically, you can fine-tune only decoder, so input and tokens will be the same, but output is 48 kHz.

Also, have you tried using phonemes? I want to build phoneme-based LLM TTS with LJSpeech

YatharthS

Owner 15 days ago

@yukiarimo Yep good idea, this is similar to NandemoGHS's xcodec2 model.

This model doesn't support phonemes because of its base model but yes, I'm experimenting with a custom smaller LLM TTS model I can train from scratch with phonemes, audio events, etc. It will also use an improved version of LayaCodec so it's even faster with a permissive license. However, that will take some time.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment