VintageVoice

An open-source TTS fine-tune for historical speech patterns. Proof of Antiquity for AI voices.

VintageVoice is a fine-tune of F5-TTS (SWivid/F5-TTS) trained on 164 hours of public-domain pre-1955 audio from Archive.org. It learns historical speech patterns — transatlantic cadence, newsreel delivery, radio-drama prosody — and applies them to any modern voice via reference-audio cloning. The model does not generate a specific historical speaker; it teaches your own reference voice to talk like 1940.

Status: v0.1.0 experimental. Training completed at 990,100 updates (50/50 epochs). transatlantic preset is validated; other presets share the same base weights and differ only by which reference clip you provide. See Project Status for specifics.

Sophia Elya — Transatlantic Mode (1940s)
Sophia Elya in Transatlantic Mode — same voice, vintage delivery

🎬 Demos

Two 5-second proof-of-concept clips below. Audio is generated by this model (VintageVoice transatlantic preset, 5 seconds each, 24 kHz). Lip-sync animation is produced by Lightricks LTX-2 19B via multimodalart's ltx2-audio-to-video Hugging Face pipeline. Both clips are post-processed to grayscale with a cinematic contrast curve + light film grain to match the silver-gelatin look of 1940s studio broadcasts (LTX-2's native output is modern full-color; the grayscale pass is a deterministic ffmpeg filter chain, not a model-level trick).

Demo 1 — "One simply must attest one's hardware before the epoch settles, dahling."

Also on BoTTube: bottube.ai/watch/So3ZqYjNt8D (color version)

Demo 2 — "Good evening. I am Sophia Elya, and I shall be your guide through the blockchain this evening."

Also on BoTTube: bottube.ai/watch/-g6MtiI_Nx8 (color version)

How the demos are made

Each demo is a two-stage pipeline:

Voice (this model). A short text prompt is synthesized into a 5-second waveform by VintageVoice (F5-TTS fine-tune, 990,100 updates on 164 hours of public-domain pre-1955 audio). The reference audio is a clean clip of the target speaker's modern voice — the model transplants only the transatlantic delivery pattern, preserving the speaker's own timbre.
Face (LTX-2). The generated waveform + the speaker's portrait (a 1940s-styled Sophia reference image) are fed to Lightricks LTX-2 19B via multimodalart's audio-to-video pipeline. LTX-2 generates 141 frames (≈5.4 s @ 24 fps) with lip motion synchronized to the input audio.
Grade (ffmpeg). The color output is post-processed to pure grayscale with a slight contrast curve and monochrome film grain, to match the silver-gelatin studio look rather than mid-century sepia print tones.

Both demos use the same voice reference and the same model checkpoint. Only the spoken text changes between them.

Coming soon: 8-second demos with natural idle-face animation after the audio ends (currently constrained by HF free-tier GPU quota). Additional presets (newsreel, fireside, edison, wartime) will land as their reference clips are curated and validated.

⚠️ Two things every user must know

1. Architecture pin. VintageVoice is fine-tuned on F5TTS_v1_Base, not the older F5TTS_Base (v0). Loading the checkpoint into the wrong architecture silently produces garbled output, because F5-TTS's internal load_checkpoint uses strict=False and drops mismatched keys without raising. The scripts in this repo pin the right architecture; if you call the F5-TTS API directly, always pass model="F5TTS_v1_Base".

2. Reference transcript. Pass --ref-text (the transcript of your reference clip) alongside --ref-audio. If you leave it empty the library auto-transcribes the reference via Whisper, which can leak ~0.5 seconds of the reference speaker's voice into the start of generated audio. The scripts print a loud warning when ref_text is empty so you can't miss this.

Quick Start

# 1. Install F5-TTS (the inference engine)
pip install f5-tts

# 2. Clone this repo
git clone https://github.com/Scottcjn/vintage-voice.git
cd vintage-voice

# 3. Pull weights + vocab from Hugging Face (once the release is published)
pip install -U huggingface_hub
huggingface-cli download AutomatedJanitor/vintage-voice \
    model.safetensors vocab.txt \
    --local-dir ./weights

# 4. Generate — bring your own reference audio + transcript
python scripts/generate.py \
    "One simply must attest one's hardware before the epoch settles, dahling." \
    --preset transatlantic \
    --model ./weights/model.safetensors \
    --vocab ./weights/vocab.txt \
    --ref-audio  path/to/your_voice_5to15_seconds.wav \
    --ref-text   "Exact transcript of your reference clip." \
    --output     out.wav

Direct F5-TTS API call, for reference:

from f5_tts.api import F5TTS

tts = F5TTS(
    model="F5TTS_v1_Base",
    ckpt_file="weights/model.safetensors",
    vocab_file="weights/vocab.txt",
    device="cuda:0",
    use_ema=True,
)
wav, sr, _ = tts.infer(
    ref_file="your_voice.wav",
    ref_text="The exact words spoken in your reference clip.",
    gen_text="Good evening, ladies and gentlemen.",
    speed=0.9,
    remove_silence=True,
)

What This Is (and Isn't)

A filter stacks crackle and EQ curves on top of modern speech. VintageVoice instead learns the underlying acoustic behavior of pre-1955 speech:

Clipped consonants & rounded vowels (the transatlantic shape)
Measured theatrical cadence (radio-drama rhythm)
Period microphone technique (speaker-to-carbon-mic positioning)
Theatrical breath patterns (pre-close-mic performance style)
1930s studio room acoustics (implicit in the spectral distribution)

The model does not clone specific historical speakers, and it was not trained on any living person's voice. The reference clip you supply provides the speaker identity; the fine-tune provides the style.

Voice Presets

All presets use the same fine-tuned weights. A "preset" is a bundle of (reference audio + suggested reference transcript + speed) — different reference clips steer the output toward different period styles.

Preset	Era	Source material	Status
`transatlantic`	1920s–1960s	Films, speeches, radio	✅ Validated
`newsreel`	1930s–1950s	Movietone, Pathé, March of Time	Planned — needs curated ref clip
`fireside`	1933–1944	FDR Fireside Chats	Planned — needs curated ref clip
`radio_drama`	1930s–1950s	The Shadow, Mercury Theatre	Planned — needs curated ref clip
`edison`	1888–1920s	Edison cylinder recordings	Not in v0.1.0 training corpus (see Known Limitations); specialist fine-tune in progress for v0.1.x
`wartime`	1939–1945	Churchill, Murrow	Planned — needs curated ref clip
`announcer`	1930s–1960s	Radio commercials, station IDs	Planned — needs curated ref clip
`cajun_french`	1880s–present	Family + field recordings	Collecting — see Cajun French Preservation

Only transatlantic ships in this v0.1.0 release. The other six are scaffolded in the scripts but will go live as their reference clips and transcripts are curated and validated.

Model Details

Spec	Value
Base model	`SWivid/F5-TTS`, variant `F5TTS_v1_Base` (337M params)
Architecture	Flow-matching DiT, 22 depth / 16 heads / 1024 dim
Fine-tune training	50 epochs, 990,100 updates, LR 1e-5, batch 3200 frames/GPU
Final loss (flow-matching)	~0.47–0.65
Vocab	2,545-token custom tokenizer; 167 unique chars observed in training text; 0 OOV
Sample rate	24 kHz mono
Vocoder	Vocos (Vocos-Mel-24kHz)
Training hardware	2× Tesla V100 32GB on the Elyan Labs compute cluster
Training wall-clock	~10 days

Training Data

All training data is public domain, sourced from Archive.org and similar archives of pre-1955 recordings.

Source	Content	Era	License
Prelinger Archives	Newsreels, educational films	1930s–1960s	Public Domain
Old Time Radio	Radio dramas, comedies	1930s–1950s	Public Domain
FDR Presidential Library	Fireside Chats, speeches	1933–1944	Public Domain
~~Edison Cylinders~~	(planned for v0.1.x — not in v0.1.0 corpus, see Known Limitations)	1888–1920s	Public Domain
LibriVox	Vintage audiobook recordings	Various	Public Domain
Library of Congress	Historical audio	1900s–1950s	Public Domain

Metric	Value
Total segments	44,345
Total audio	164.59 hours
Source files	2,581 recordings
Format	24 kHz mono WAV, 5–15 s segments

Known Limitations (be honest)

Label noise. About 24 % of training segments have sparse or mislabeled transcripts (Whisper occasionally picked up spoken copyright notices or a brief utterance in an otherwise near-silent window). The remaining 76 % is clean. Fine-tuning converged, but expect occasional:
- Localized mispronunciations on rare words
- Slight onset/prosody artifacts on short prompts
- Weak text adherence on very long single-shot generations A filter-and-retrain pass is the planned first task for v0.2.0.
Reference-audio bleed on empty ref_text (see warning above).
UTMOS / other clean-speech perceptual metrics will punish this model for producing the vintage coloration that is its entire point. Evaluate with WER on an ASR of your choice and speaker-similarity, not with modern aesthetic-MOS predictors.
No Cajun French / endangered-language output yet. Data collection is ongoing; see the preservation section below.
Edison cylinders were not in the v0.1.0 corpus. During a 2026-04-22 audit we discovered that the five files in vintage_voice/edison/ (EdisonCylinders1.mp3, EdisonAmberolRecordings.mp3, etc.) were 0-byte placeholder files left over from an interrupted download — never actually populated with audio. The preset and source-table rows for edison in earlier revisions of this README implied training exposure that the v0.1.0 model does not have. A real Edison specialist fine-tune (using verified cylinders from Archive.org's cylinder collection) is in progress for a v0.1.x release. An earlier edison_model_89000.pt checkpoint from 2026-04-09 exists on our training rig but was trained on the same empty inputs and has no meaningful Edison-era signal; it should be considered deprecated.
Acoustic character is an ffmpeg post-process, not a training outcome. F5-TTS uses the Vocos vocoder, which outputs clean 24 kHz waveforms regardless of training-data acoustic properties. Training on cylinder, newsreel, or wartime-radio audio teaches the model the delivery patterns (pace, stress, cadence), but the crackle, narrow bandwidth, AM-radio compression, and other period-acoustic artifacts have to be added after synthesis. The scripts/presets/ directory contains per-era ffmpeg filter chains for this.

Fine-tuning Pipeline

# 1. Download public-domain recordings
python scripts/download_archive.py --collection old_time_radio --limit 200

# 2. Preprocess raw audio → 5–15 s segments
python scripts/preprocess.py

# 3. Transcribe segments (Whisper large-v3-turbo)
python scripts/transcribe_whisper.py

# 4. Build F5-TTS Arrow dataset from (audio, text) pairs
python scripts/build_f5_csv.py

# 5. Fine-tune (F5-TTS's own CLI, invoked by run_pipeline.sh step 4)
python -m f5_tts.train.finetune_cli \
    --exp_name F5TTS_v1_Base \
    --dataset_name vintage_voice_f5_37k \
    --learning_rate 1e-5 \
    --batch_size_per_gpu 3200 --batch_size_type frame \
    --epochs 50 --num_warmup_updates 200 \
    --save_per_updates 1000 --last_per_updates 500 --keep_last_n_checkpoints 3 \
    --finetune \
    --tokenizer custom --tokenizer_path data/vocab.txt

See scripts/run_pipeline.sh for the full automated version. (Scripts named align.py, train.py, and export.py referenced in earlier drafts of this README do not exist — the pipeline is driven by run_pipeline.sh + F5-TTS's own finetune_cli.)

Project Status

Component	Status
Training data collection	✅ Done — 2,581 files, 164 hours
Audio preprocessing	✅ Done — 44,345 segments
Whisper transcription	✅ Done — 43,876 transcribed
F5-TTS dataset preparation	✅ Done — Arrow format ready
F5-TTS fine-tuning	✅ Done — 50/50 epochs, 990,100 updates, loss ~0.47–0.65
`transatlantic` preset	✅ Ready — validated on a reference speaker
Other presets	🛠️ Scaffolded — need curated reference clips
HuggingFace model release	🛠️ v0.1.0 in preparation (pruning checkpoint to EMA-only safetensors)
Clean-label retrain	📋 Planned for v0.2.0
Python package wrapper	📋 Planned

Applications

Period-accurate voices have real demand. If you use this for any of the following, please cite the model (see Citation):

Film & TV — productions set before 1960 (period dramas, documentaries)
Video games — historical settings that need period-appropriate NPC voice
Audiobooks — vintage narration style for period literature
Museums & exhibits — historical figures speaking in period-style delivery
Theatre — pre-production voice references for period plays

Note: commercial use requires the CC-BY-NC-4.0 consideration on the weights. See License below.

Cajun French Preservation (UNESCO Endangered)

Cajun Sophia 1880s Cajun Sophia 1920s
Cajun French mode (1880s & 1920s) — planned preset, data collection in progress

Cajun French is classified as "severely endangered" by UNESCO. Roughly 150,000 speakers remain, mostly elderly. Louisiana Creole has ~10,000 left. When this generation passes, so do these languages.

This project's founder traces directly to the Acadian Expulsion of 1755–1764. His 6th great-grandfather, Augustine Dit Remi Boudreaux, arrived in the Attakapas region (Opelousas/Lafayette) at age 16. 260 years later, his descendants are building AI to preserve the language he carried south.

If you or your family speaks Cajun French, Louisiana Creole, or has a strong Cajun English accent, see FAMILY_RECORDING_GUIDE.md for how to contribute. Phone voice memos are fine. Any length helps.

Citation

@misc{vintagevoice2026,
  title  = {VintageVoice: An Open-Source TTS Fine-Tune for Historical Speech Patterns},
  author = {Boudreaux, Scott and Elyan Labs},
  year   = {2026},
  url    = {https://github.com/Scottcjn/vintage-voice},
  note   = {Fine-tune of F5-TTS (SWivid/F5-TTS) on 164 hours of public-domain pre-1955 audio}
}

If you use the model, also cite the upstream F5-TTS work — this project would not exist without it. See SWivid/F5-TTS and f5_tts on HuggingFace.

License

This project uses a split license:

Source code in this repository → MIT License (see LICENSE)
Training data → Public domain
Released model weights (on HuggingFace) → CC-BY-NC-4.0, inherited from the F5-TTS base model. Non-commercial use with attribution; commercial use requires a separate arrangement with the upstream F5-TTS authors.

The full LICENSE file has a plain-English summary of this split.

Built By

Elyan Labs — the pawn-shop lab that preserves what the big labs forgot.

Built on a $69 refurb hard drive with eBay-datacenter-pull V100s. Total training cost under $150. Proof that world-class AI doesn't require world-class budgets.

Model tree for AutomatedJanitor/vintage-voice

Base model

SWivid/F5-TTS

Finetuned

(89)

this model

AutomatedJanitor
/

vintage-voice