| Field | Response |
|---|---|
| Intended Task/Domain: | Speaker Diarization (Speaker Tagging in Speech Recognition) |
| Model Type: | FastConformer Encoder, Transformer Encoder, and RNNT Decoder |
| Intended Users: | People working with conversational AI models that transcribe speech-to-text for multiple users. |
| Output: | Text with speaker tags |
| Describe how the model works: | The model incorporates a novel mechanism, the Arrival-Order Speaker Cache (AOSC). This cache management technique dynamically adjusts each speaker’s cache size, prioritizing the speech frames most valuable to cache. The model is fine-tuned with increased weighting on far-field datasets to perform better for meeting-style speech. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
| Technical Limitations & Mitigation: | This model can detect up to four speakers; performance degrades in recordings with five or more speakers. The model was trained on publicly available English speech datasets. As a result, it is not suitable for non-English audio. Performance may also degrade on out-of-domain data, such as recordings in noisy conditions. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Concatenated minimum-permutation word error rate (cpWER) and time-constrained minimum-permutation word error rate (tcpWER) |
| Potential Known Risks: | Transcripts may not be 100% accurate in instances with background noise. Punctuation/capitalization may not be 100% accurate. |
| Licensing: | GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement (found here |