|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- ja |
|
|
pipeline_tag: feature-extraction |
|
|
tags: |
|
|
- streaming |
|
|
- NeMo |
|
|
- PyTorch |
|
|
- Automatic Speech Recognition |
|
|
- FastConformer |
|
|
- CTC |
|
|
- hybrid |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_23 |
|
|
|
|
|
model-index: |
|
|
- name: Fast_Transducer-CTC_ctc-0.1b-ja |
|
|
results: |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: JSUT basic5000 |
|
|
type: japanese-asr/ja_asr.jsut_basic5000 |
|
|
split: test |
|
|
args: |
|
|
language: ja |
|
|
metrics: |
|
|
- name: Test CER |
|
|
type: cer |
|
|
value: 10.18 |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: Mozilla Common Voice 16.1 |
|
|
type: mozilla-foundation/common_voice_16_1 |
|
|
config: ja |
|
|
split: test |
|
|
args: |
|
|
language: ja |
|
|
metrics: |
|
|
- name: Test CER |
|
|
type: cer |
|
|
value: 19.0 |
|
|
|
|
|
--- |
|
|
|
|
|
# Streaming FastConformer-Hybrid Large (Ja) |
|
|
This collection contains large size versions of cache-aware FastConformer-Hybrid (around 120M parameters) trained on a Japanse speech. These models are trained for streaming ASR with look-ahead of 1040ms which be used for very low-latency streaming applications. The model is hybrid with both Transducer and CTC decoders. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
These models are cache-aware versions of Hybrid FastConfomer which are trained for streaming ASR. You may find more info on cache-aware models here: [Cache-aware Streaming Conformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#cache-aware-streaming-conformer) . |
|
|
The models are trained with multiple look-aheads which makes the model to be able to support different latencies. |
|
|
To learn on how to switch between different look-ahead, you may read the documentation on the cache-aware models. |
|
|
|
|
|
### Datasets |
|
|
|
|
|
The model in this collection is trained on two datasets comprising approxinately 20000 hours of Janpanese speech: |
|
|
|
|
|
- Mozilla Common Voice Ja(v23.0) |
|
|
- AsrSet_Ja |
|
|
|
|
|
## Performance |
|
|
|
|
|
The following table summarizes the performance of this model in terms of Character Error Rate (CER%). |
|
|
|
|
|
In CER calculation, punctuation marks and non-alphabet characters are removed, and numbers are transformed to words using `num2words` library. |
|
|
|
|
|
|**Version**|**Decoder**|**JSUT basic5000**|**MCV 8.0 test**|**MCV 16.1 dev**|**MCV 16.1 test**| |
|
|
|:---:|:---:|:---:|:---:|:---:|:---:| |
|
|
| 1.1.0 | CTC | 10.18 | 10.53 | 14.47 | 19.0 | |