SNAC output quality with high-pitch speech

#23
by matthen - opened

This is more of an issue with the snac model-

I'm finding that for some high-pitched speech, the decoded output sounds like the voice is sort of cracking. Below I have attached an example waveform, the result of snac encode/decode, and a screenshot of the spectrograms.

So this is not using Orpheus, but just testing the snac model as a vocoder. But when I finetune Orpheus on waveforms like this, it ends up outputting this kind of artefact.

I'm curious if this is something that anyone has seen before? And if there are any ideas for how to improve performance, maybe some pre-processing on the waveform that could help?
I can also try to remove training examples that exceed a certain f0 pitch...

Thanks!

input file:

encoded then decoded with hubertsiuzdak/snac_24khz:

spectrograms- I circled the part in the decoded output where the formants are kind of disconnected:
image.png

Yep, I get the exact same thing, particularly with higher pitched female voices, and often when they say "sure" (one of my tests has the models generate something like "Sure, I'll help you review this document".
And a lot of Japanese female -- there's no point even trying to train Orpheus with them.

I get the same artifacts with your sample here Gapeleon/snac_test

FWIW, the 32khz snac sounds better for your sample (in case you just want to use it as a vocoder) 32kz version

Thanks Gapoleon!
I tweeted about this and Elias confirmed he is aware of this. Apparently they are working on fixing it in the next release-

https://x.com/eliasfiz/status/1940511419147210956?s=46&t=qmPyWuzPKCugMlCPqjWkAw

Sign up or log in to comment