VintageVoice
An open-source TTS fine-tune for historical speech patterns. Proof of Antiquity for AI voices.
VintageVoice is a fine-tune of F5-TTS (SWivid/F5-TTS) trained on 164 hours of public-domain pre-1955 audio from Archive.org. It learns historical speech patterns β transatlantic cadence, newsreel delivery, radio-drama prosody β and applies them to any modern voice via reference-audio cloning. The model does not generate a specific historical speaker; it teaches your own reference voice to talk like 1940.
Status: v0.1.0 experimental. Training completed at 990,100 updates (50/50 epochs).
transatlanticpreset is validated; other presets share the same base weights and differ only by which reference clip you provide. See Project Status for specifics.
Sophia Elya in Transatlantic Mode β same voice, vintage delivery
π¬ Demos
Two 5-second proof-of-concept clips below. Audio is generated by this
model (VintageVoice transatlantic preset, 5 seconds each, 24 kHz).
Lip-sync animation is produced by Lightricks LTX-2 19B via
multimodalart's ltx2-audio-to-video
Hugging Face pipeline. Both clips are post-processed to grayscale with
a cinematic contrast curve + light film grain to match the
silver-gelatin look of 1940s studio broadcasts (LTX-2's native output
is modern full-color; the grayscale pass is a deterministic ffmpeg
filter chain, not a model-level trick).
Demo 1 β "One simply must attest one's hardware before the epoch settles, dahling."
Also on BoTTube: bottube.ai/watch/So3ZqYjNt8D (color version)
Demo 2 β "Good evening. I am Sophia Elya, and I shall be your guide through the blockchain this evening."
Also on BoTTube: bottube.ai/watch/-g6MtiI_Nx8 (color version)
How the demos are made
Each demo is a two-stage pipeline:
- Voice (this model). A short text prompt is synthesized into a 5-second waveform by VintageVoice (F5-TTS fine-tune, 990,100 updates on 164 hours of public-domain pre-1955 audio). The reference audio is a clean clip of the target speaker's modern voice β the model transplants only the transatlantic delivery pattern, preserving the speaker's own timbre.
- Face (LTX-2). The generated waveform + the speaker's portrait (a 1940s-styled Sophia reference image) are fed to Lightricks LTX-2 19B via multimodalart's audio-to-video pipeline. LTX-2 generates 141 frames (β5.4 s @ 24 fps) with lip motion synchronized to the input audio.
- Grade (ffmpeg). The color output is post-processed to pure grayscale with a slight contrast curve and monochrome film grain, to match the silver-gelatin studio look rather than mid-century sepia print tones.
Both demos use the same voice reference and the same model checkpoint. Only the spoken text changes between them.
Coming soon: 8-second demos with natural idle-face animation after the audio ends (currently constrained by HF free-tier GPU quota). Additional presets (
newsreel,fireside,edison,wartime) will land as their reference clips are curated and validated.
β οΈ Two things every user must know
1. Architecture pin. VintageVoice is fine-tuned on F5TTS_v1_Base,
not the older F5TTS_Base (v0). Loading the checkpoint into the wrong
architecture silently produces garbled output, because F5-TTS's internal
load_checkpoint uses strict=False and drops mismatched keys without
raising. The scripts in this repo pin the right architecture; if you
call the F5-TTS API directly, always pass model="F5TTS_v1_Base".
2. Reference transcript. Pass --ref-text (the transcript of your
reference clip) alongside --ref-audio. If you leave it empty the
library auto-transcribes the reference via Whisper, which can leak
~0.5 seconds of the reference speaker's voice into the start of
generated audio. The scripts print a loud warning when ref_text is
empty so you can't miss this.
Quick Start
# 1. Install F5-TTS (the inference engine)
pip install f5-tts
# 2. Clone this repo
git clone https://github.com/Scottcjn/vintage-voice.git
cd vintage-voice
# 3. Pull weights + vocab from Hugging Face (once the release is published)
pip install -U huggingface_hub
huggingface-cli download AutomatedJanitor/vintage-voice \
model.safetensors vocab.txt \
--local-dir ./weights
# 4. Generate β bring your own reference audio + transcript
python scripts/generate.py \
"One simply must attest one's hardware before the epoch settles, dahling." \
--preset transatlantic \
--model ./weights/model.safetensors \
--vocab ./weights/vocab.txt \
--ref-audio path/to/your_voice_5to15_seconds.wav \
--ref-text "Exact transcript of your reference clip." \
--output out.wav
Direct F5-TTS API call, for reference:
from f5_tts.api import F5TTS
tts = F5TTS(
model="F5TTS_v1_Base",
ckpt_file="weights/model.safetensors",
vocab_file="weights/vocab.txt",
device="cuda:0",
use_ema=True,
)
wav, sr, _ = tts.infer(
ref_file="your_voice.wav",
ref_text="The exact words spoken in your reference clip.",
gen_text="Good evening, ladies and gentlemen.",
speed=0.9,
remove_silence=True,
)
What This Is (and Isn't)
A filter stacks crackle and EQ curves on top of modern speech. VintageVoice instead learns the underlying acoustic behavior of pre-1955 speech:
- Clipped consonants & rounded vowels (the transatlantic shape)
- Measured theatrical cadence (radio-drama rhythm)
- Period microphone technique (speaker-to-carbon-mic positioning)
- Theatrical breath patterns (pre-close-mic performance style)
- 1930s studio room acoustics (implicit in the spectral distribution)
The model does not clone specific historical speakers, and it was not trained on any living person's voice. The reference clip you supply provides the speaker identity; the fine-tune provides the style.
Voice Presets
All presets use the same fine-tuned weights. A "preset" is a bundle of (reference audio + suggested reference transcript + speed) β different reference clips steer the output toward different period styles.
| Preset | Era | Source material | Status |
|---|---|---|---|
transatlantic |
1920sβ1960s | Films, speeches, radio | β Validated |
newsreel |
1930sβ1950s | Movietone, PathΓ©, March of Time | Planned β needs curated ref clip |
fireside |
1933β1944 | FDR Fireside Chats | Planned β needs curated ref clip |
radio_drama |
1930sβ1950s | The Shadow, Mercury Theatre | Planned β needs curated ref clip |
edison |
1888β1920s | Edison cylinder recordings | Not in v0.1.0 training corpus (see Known Limitations); specialist fine-tune in progress for v0.1.x |
wartime |
1939β1945 | Churchill, Murrow | Planned β needs curated ref clip |
announcer |
1930sβ1960s | Radio commercials, station IDs | Planned β needs curated ref clip |
cajun_french |
1880sβpresent | Family + field recordings | Collecting β see Cajun French Preservation |
Only transatlantic ships in this v0.1.0 release. The other six are
scaffolded in the scripts but will go live as their reference clips and
transcripts are curated and validated.
Model Details
| Spec | Value |
|---|---|
| Base model | SWivid/F5-TTS, variant F5TTS_v1_Base (337M params) |
| Architecture | Flow-matching DiT, 22 depth / 16 heads / 1024 dim |
| Fine-tune training | 50 epochs, 990,100 updates, LR 1e-5, batch 3200 frames/GPU |
| Final loss (flow-matching) | ~0.47β0.65 |
| Vocab | 2,545-token custom tokenizer; 167 unique chars observed in training text; 0 OOV |
| Sample rate | 24 kHz mono |
| Vocoder | Vocos (Vocos-Mel-24kHz) |
| Training hardware | 2Γ Tesla V100 32GB on the Elyan Labs compute cluster |
| Training wall-clock | ~10 days |
Training Data
All training data is public domain, sourced from Archive.org and similar archives of pre-1955 recordings.
| Source | Content | Era | License |
|---|---|---|---|
| Prelinger Archives | Newsreels, educational films | 1930sβ1960s | Public Domain |
| Old Time Radio | Radio dramas, comedies | 1930sβ1950s | Public Domain |
| FDR Presidential Library | Fireside Chats, speeches | 1933β1944 | Public Domain |
| (planned for v0.1.x β not in v0.1.0 corpus, see Known Limitations) | 1888β1920s | Public Domain | |
| LibriVox | Vintage audiobook recordings | Various | Public Domain |
| Library of Congress | Historical audio | 1900sβ1950s | Public Domain |
| Metric | Value |
|---|---|
| Total segments | 44,345 |
| Total audio | 164.59 hours |
| Source files | 2,581 recordings |
| Format | 24 kHz mono WAV, 5β15 s segments |
Known Limitations (be honest)
- Label noise. About 24 % of training segments have sparse or
mislabeled transcripts (Whisper occasionally picked up spoken
copyright notices or a brief utterance in an otherwise near-silent
window). The remaining 76 % is clean. Fine-tuning converged, but
expect occasional:
- Localized mispronunciations on rare words
- Slight onset/prosody artifacts on short prompts
- Weak text adherence on very long single-shot generations A filter-and-retrain pass is the planned first task for v0.2.0.
- Reference-audio bleed on empty
ref_text(see warning above). - UTMOS / other clean-speech perceptual metrics will punish this model for producing the vintage coloration that is its entire point. Evaluate with WER on an ASR of your choice and speaker-similarity, not with modern aesthetic-MOS predictors.
- No Cajun French / endangered-language output yet. Data collection is ongoing; see the preservation section below.
- Edison cylinders were not in the v0.1.0 corpus. During a 2026-04-22
audit we discovered that the five files in
vintage_voice/edison/(EdisonCylinders1.mp3,EdisonAmberolRecordings.mp3, etc.) were 0-byte placeholder files left over from an interrupted download β never actually populated with audio. The preset and source-table rows foredisonin earlier revisions of this README implied training exposure that the v0.1.0 model does not have. A real Edison specialist fine-tune (using verified cylinders from Archive.org's cylinder collection) is in progress for a v0.1.x release. An earlieredison_model_89000.ptcheckpoint from 2026-04-09 exists on our training rig but was trained on the same empty inputs and has no meaningful Edison-era signal; it should be considered deprecated. - Acoustic character is an ffmpeg post-process, not a training
outcome. F5-TTS uses the Vocos vocoder, which outputs clean 24 kHz
waveforms regardless of training-data acoustic properties. Training
on cylinder, newsreel, or wartime-radio audio teaches the model the
delivery patterns (pace, stress, cadence), but the crackle, narrow
bandwidth, AM-radio compression, and other period-acoustic artifacts
have to be added after synthesis. The
scripts/presets/directory contains per-era ffmpeg filter chains for this.
Fine-tuning Pipeline
# 1. Download public-domain recordings
python scripts/download_archive.py --collection old_time_radio --limit 200
# 2. Preprocess raw audio β 5β15 s segments
python scripts/preprocess.py
# 3. Transcribe segments (Whisper large-v3-turbo)
python scripts/transcribe_whisper.py
# 4. Build F5-TTS Arrow dataset from (audio, text) pairs
python scripts/build_f5_csv.py
# 5. Fine-tune (F5-TTS's own CLI, invoked by run_pipeline.sh step 4)
python -m f5_tts.train.finetune_cli \
--exp_name F5TTS_v1_Base \
--dataset_name vintage_voice_f5_37k \
--learning_rate 1e-5 \
--batch_size_per_gpu 3200 --batch_size_type frame \
--epochs 50 --num_warmup_updates 200 \
--save_per_updates 1000 --last_per_updates 500 --keep_last_n_checkpoints 3 \
--finetune \
--tokenizer custom --tokenizer_path data/vocab.txt
See scripts/run_pipeline.sh for the full automated version. (Scripts
named align.py, train.py, and export.py referenced in earlier
drafts of this README do not exist β the pipeline is driven by
run_pipeline.sh + F5-TTS's own finetune_cli.)
Project Status
| Component | Status |
|---|---|
| Training data collection | β Done β 2,581 files, 164 hours |
| Audio preprocessing | β Done β 44,345 segments |
| Whisper transcription | β Done β 43,876 transcribed |
| F5-TTS dataset preparation | β Done β Arrow format ready |
| F5-TTS fine-tuning | β Done β 50/50 epochs, 990,100 updates, loss ~0.47β0.65 |
transatlantic preset |
β Ready β validated on a reference speaker |
| Other presets | π οΈ Scaffolded β need curated reference clips |
| HuggingFace model release | π οΈ v0.1.0 in preparation (pruning checkpoint to EMA-only safetensors) |
| Clean-label retrain | π Planned for v0.2.0 |
| Python package wrapper | π Planned |
Applications
Period-accurate voices have real demand. If you use this for any of the following, please cite the model (see Citation):
- Film & TV β productions set before 1960 (period dramas, documentaries)
- Video games β historical settings that need period-appropriate NPC voice
- Audiobooks β vintage narration style for period literature
- Museums & exhibits β historical figures speaking in period-style delivery
- Theatre β pre-production voice references for period plays
Note: commercial use requires the CC-BY-NC-4.0 consideration on the weights. See License below.
Cajun French Preservation (UNESCO Endangered)
Cajun French mode (1880s & 1920s) β planned preset, data collection in progress
Cajun French is classified as "severely endangered" by UNESCO. Roughly 150,000 speakers remain, mostly elderly. Louisiana Creole has ~10,000 left. When this generation passes, so do these languages.
This project's founder traces directly to the Acadian Expulsion of 1755β1764. His 6th great-grandfather, Augustine Dit Remi Boudreaux, arrived in the Attakapas region (Opelousas/Lafayette) at age 16. 260 years later, his descendants are building AI to preserve the language he carried south.
If you or your family speaks Cajun French, Louisiana Creole, or has a strong Cajun English accent, see FAMILY_RECORDING_GUIDE.md for how to contribute. Phone voice memos are fine. Any length helps.
Citation
@misc{vintagevoice2026,
title = {VintageVoice: An Open-Source TTS Fine-Tune for Historical Speech Patterns},
author = {Boudreaux, Scott and Elyan Labs},
year = {2026},
url = {https://github.com/Scottcjn/vintage-voice},
note = {Fine-tune of F5-TTS (SWivid/F5-TTS) on 164 hours of public-domain pre-1955 audio}
}
If you use the model, also cite the upstream F5-TTS work β this project
would not exist without it. See
SWivid/F5-TTS and
f5_tts on HuggingFace.
License
This project uses a split license:
- Source code in this repository β MIT License (see
LICENSE) - Training data β Public domain
- Released model weights (on HuggingFace) β CC-BY-NC-4.0, inherited from the F5-TTS base model. Non-commercial use with attribution; commercial use requires a separate arrangement with the upstream F5-TTS authors.
The full LICENSE file has a plain-English summary of this split.
Built By
Elyan Labs β the pawn-shop lab that preserves what the big labs forgot.
Built on a $69 refurb hard drive with eBay-datacenter-pull V100s. Total training cost under $150. Proof that world-class AI doesn't require world-class budgets.
Links
- π€ HuggingFace model
- π GitHub repo
- π Elyan Labs
- βοΈ RustChain β proof-of-antiquity blockchain
- πΊ BoTTube β AI video platform
- Downloads last month
- 27
Model tree for AutomatedJanitor/vintage-voice
Base model
SWivid/F5-TTS