Trelis Chorus v1

Need a voice model for your domain? Trelis builds custom ASR, TTS, and voice agent pipelines for specialist verticals (legal, medical, finance, construction) and low-resource languages. Enquire or book a consultation →

Trelis Chorus v1 is a multi-speaker fine-tune of openai/whisper-large-v3-turbo. Given a mono audio clip of up to 30s containing two people speaking (possibly overlapping), it returns a separate transcript for each speaker — with timestamps — by conditioning the decoder on a <|speaker1|> or <|speaker2|> token.

Typical use cases

Meeting transcription. Two-person calls, interviews, or podcast segments where speakers overlap. Chorus returns a separate timestamped transcript per speaker without an upstream diarization step.
Clean single-speaker transcripts from imperfect isolation. If you already have a close-mic recording of the speaker you care about but there's audible cross-talk from the other party, running Chorus with the <|speaker1|> token (first-to-speak in the clip) gives a transcript that omits the other speaker's words — cleaner than vanilla Whisper, which mixes both.

Highlights

Single forward pass per speaker (two passes for a 2-speaker transcript), speaker1 is the first to speak, speaker2 is the second to speak.
Native timestamp emission — segments are emitted as <|start|> text <|end|> pairs
Handles overlapping speech up to ~80% overlap in our AMI evaluation
Keeps vanilla Whisper's output style (mixed-case English with punctuation, acronyms preserved)
speaker1 = first speaker to begin talking in the clip; speaker2 = the other

Illustrative example of templating for training and inference:

<|speaker1|>First speaker utterances
<|speaker2|>Second speaker utterances

During training and inference, the model gets a separate forward pass for each speaker. The model learns to associate only the first speakers audio (including later turns) with the speaker1 token, and the second speaker's speech with the speaker2 token.

Benchmark

Evaluated on Trelis/ami-2speaker-test — 50 real AMI meeting clips reconstructed as 2-speaker audio (average 20s, average overlap ~30%, up to 78%).

Metric	Speaker 1	Speaker 2	Mean
CER	8.58%	10.12%	9.35%
CMER (bounded)	8.32%	9.69%	9.00%

CER is standard character error rate via whisper_normalizer + jiwer. CMER = (S+D+I) / (H+S+D+I), bounded to [0, 1]; more stable than CER for conversational speech with heavy deletions/insertions.

Per-row predictions browseable at Trelis/chorus-v1-ami-2speaker-test-preds.

Usage via Hosted API (Trelis Router)

Chorus is available as a hosted GPU endpoint on Trelis Router — no GPU setup, handles long audio end-to-end (VAD chunking + cross-chunk speaker clustering).

Model id: trelis/chorus-v1 · Base URL: https://router.trelis.com/api/v1
Files up to 100 MB · Output: json / srt / vtt / text

curl -X POST https://router.trelis.com/api/v1/transcribe \
  -H "Authorization: Bearer $TRELIS_ROUTER_API_KEY" \
  -F model=trelis/chorus-v1 \
  -F file=@meeting.wav

Open Weights Usage

Transformers

import torch
import soundfile as sf
from transformers import WhisperForConditionalGeneration, WhisperProcessor

MODEL = "Trelis/Chorus-v1"
processor = WhisperProcessor.from_pretrained(MODEL)
model = (
    WhisperForConditionalGeneration
    .from_pretrained(MODEL, dtype=torch.float16)
    .to("cuda")
    .eval()
)
model.generation_config.predict_timestamps = True
model.generation_config.max_initial_timestamp_index = 1500  # allow up to 30s

tok = processor.tokenizer
ids = {n: tok.convert_tokens_to_ids(t) for n, t in [
    ("en", "<|en|>"), ("transcribe", "<|transcribe|>"),
    ("speaker1", "<|speaker1|>"), ("speaker2", "<|speaker2|>"),
]}

arr, sr = sf.read("your_clip.wav")  # 16kHz mono, <= 30s
assert sr == 16_000
feats = processor.feature_extractor(
    [arr], sampling_rate=16_000, return_tensors="pt"
).input_features.to("cuda").half()

for name in ["speaker1", "speaker2"]:
    forced = [[1, ids["en"]], [2, ids["transcribe"]], [3, ids[name]]]
    with torch.no_grad():
        out = model.generate(
            feats, forced_decoder_ids=forced,
            return_timestamps=True, max_new_tokens=444,
        )
    print(f"{name}: {tok.decode(out[0], skip_special_tokens=True)}")

Important: max_initial_timestamp_index=1500 is required — without it, HF's default caps the first emitted timestamp to 1.0s, which breaks Speaker 2 when they start talking later in the clip.

vLLM

import soundfile as sf
from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt

MODEL = "Trelis/Chorus-v1"
llm = LLM(model=MODEL, dtype="float16", max_model_len=448, enforce_eager=True)
sampling = SamplingParams(temperature=0, max_tokens=200)

arr, sr = sf.read("your_clip.wav")  # 16kHz mono, <= 30s
assert sr == 16_000

tok = llm.get_tokenizer()
for name in ["speaker1", "speaker2"]:
    prefix = [
        tok.convert_tokens_to_ids("<|startoftranscript|>"),
        tok.convert_tokens_to_ids("<|en|>"),
        tok.convert_tokens_to_ids("<|transcribe|>"),
        tok.convert_tokens_to_ids(f"<|{name}|>"),
    ]
    r = llm.generate(
        prompts=[TokensPrompt(
            prompt_token_ids=prefix,
            multi_modal_data={"audio": (arr, 16_000)},
        )],
        sampling_params=sampling,
    )
    print(f"{name}: {r[0].outputs[0].text}")

vLLM output includes literal <|N.NN|> timestamp tokens in the returned text (parse with regex <\|(\d+\.\d+)\|>); transformers strips them via skip_special_tokens=True.

Whisper.cpp

See Trelis/Chorus-v1-GGML for ggml quants and a modified whisper-cli.

Limitations

2 speakers only. Not trained on 3+ speakers; behaviour is undefined there.
English only. Not trained on other languages; Whisper's multilingual capability is retained in the encoder but decoder prompts use <|en|>.
30-second audio window (Whisper's fixed mel-spectrogram input). Chunk longer audio upstream.
Conversational/meeting speech is the training target; very noisy recordings or heavy music background are out of distribution.
speaker1 is defined as first-to-speak; if both start simultaneously, the assignment is arbitrary.

Training

Fine-tuned with LoRA (rank 32, α 16, rsLoRA) on ~17k rows: 10k speaker-leak-filtered synthetically-mixed VoxPopuli pairs + 7k real AMI IHM windows reconstructed as 2-speaker audio. Base checkpoint: openai/whisper-large-v3-turbo. One epoch, H100, ~90 min.

License

Apache 2.0.

Citation

If you use Chorus in research, please cite the AMI corpus (for the eval) and OpenAI Whisper (for the base model):

@inproceedings{carletta2006ami,
  title={The AMI Meeting Corpus: A Pre-announcement},
  author={Carletta, Jean and others},
  booktitle={Machine Learning for Multimodal Interaction},
  year={2006}
}

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and others},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}