building a voice from scratch

(ethically) training a TTS model that actually runs locally without melting your GPU

i’ve been running local LLMs and STT for a while now. they work, they’re fast enough, and most importantly: no API keys, no rate limits, no telemetry. the missing piece was always TTS. everything i tried either sounded like a GPS from 2008 or needed a datacenter to run.

cloud TTS is obviously an option. but the latency is annoying, the pricing is death-by-a-thousand-requests, and there’s something deeply uncomfortable about feeding a cloud service every thought you want spoken out loud. so i started looking at what it would take to train something myself.

the landscape was… mixed. lots of demos that sound impressive in cherry-picked clips, but fall apart on arbitrary text. diffusion models with great quality that need 10 seconds to generate 2 seconds of audio. autoregressive models that hallucinate words mid-sentence. i spent a week benchmarking everything i could find, keeping notes on failure modes: this one drifts pitch after 10 words, that one can’t handle numbers, this other one needs 24GB VRAM minimum.

none of them were consistent enough for daily use. so i started sketching out what i actually wanted:

  • deterministic output (same text → same audio, with the same seed)
  • fast enough for real-time on modest hardware
  • doesn’t sound like a robot having a stroke

midway through this endeavour, F5-TTS released. its genuinely impressive work, similar architectural instincts to what i was messing around with, but still heavier than i wanted and a pain to run locally. i kept going with my own approach.

the extremely delusional goal i set for myself was near realtime on a thinkpad, and yes i’m well aware, maybe i should’ve thought for a couple more seconds. but in reality, i only needed it to do one thing well, so one voice, one language. the goal was basically just to get something usable for now.
2024/10/24: initially forgot to mention that i actually had another goal, doing this ethically, as in with ethically sourced data with a permissive license, and data that i got explicit permission to use

as one would expect, the hardest part is consistency. it didn’t take long to get somewhat decent quality for very generic speech.

the problem is pronouncing abbreviations, numbers, all sorts of punctuation, long form text, was a very hair-pulling experience. understandably so a lot of the struggle was dataset curation. a very boring lesson that keeps being true.

fun fact, numbers are really, really, REALLY hard. thankfully though a lot of this was inference side phonetic transformation hassle, but there’s just so many types of numbers, in retrospect, it would’ve just been easier to restrict what the llm could output instead.

but that did not happen, so now we deal with: cardinal nubmers, signed numbers, real numbers, ordinal numbers, cardinal numbers followed by an s, roman numerals, fractions, sequences and so on (thankfully a community created document for existing text to speech systems helped with this).

reasonable enough to admit that the quality still isn’t amazing, it somewhat regularly fails to correctly follow progression tone, so the the intonation could go up towards the end of the sentance instead of down. but it’s still surprisingly good for a hobby project

some samples from early training runs are below, it is getting there, slowly..

00:00 / 00:00
00:00 / 00:00
00:00 / 00:00
00:00 / 00:00