TTS Exploration Part 1

Dec 15, 2024

i’ve been tinkering with TTS for a while now, i wanted something better than google assistant. something private, personal (and maybe actually useful unlike gemini).

the missing piece was quality local models. dealing with STT and LLMs was easy, but TTS? not quite there.

i could’ve used cloud services, but that not only costs money (arguably less than i wasted on this but oh well), but well, i just wanted local. my hardware. my data. no more feeding the machine.

nothing really fit just right, so, i just did it myself.

and so i tested every open model, taking notes on what they did right. some were good, some were not, but none were consistent enough for me, not even after fine-tuning on hours of data.

i will elaborate a bit more on the architecture and release the training code eventually. for now though, just progress. don’t want to overpromise and underdeliver.

F5-TTS came out mid-project. working surprisingly well, and it was fairly similar, but still not up my standard and a lot more difficult to run.

the main reason i’m sticking to my model is the consistency and the fact that its really light. but of course, nothing good comes without drawbacks. in my case, the major problem is that its very limited. one language. one voice. but that’s just about enough for me.

i’m still working on it, still improving. more voices would be nice, and its planned. but consistency and quality come first.

here’s some samples from some very early runs. more to come.