YANGGunn
/

Fast_Transducer-CTC_ctc-0.1b-streaming-ja

Feature Extraction

Automatic Speech Recognition

Model card Files Files and versions

Fast_Transducer-CTC_ctc-0.1b-streaming-ja / README.md

YANGGunn's picture

Update README.md

3846023 verified 2 days ago

|

history blame contribute delete

2.45 kB

	---
	license: cc-by-4.0
	language:
	- ja
	pipeline_tag: feature-extraction
	tags:
	- streaming
	- NeMo
	- PyTorch
	- Automatic Speech Recognition
	- FastConformer
	- CTC
	- hybrid
	datasets:
	- mozilla-foundation/common_voice_23

	model-index:
	- name: Fast_Transducer-CTC_ctc-0.1b-ja
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: JSUT basic5000
	type: japanese-asr/ja_asr.jsut_basic5000
	split: test
	args:
	language: ja
	metrics:
	- name: Test CER
	type: cer
	value: 10.18
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Mozilla Common Voice 16.1
	type: mozilla-foundation/common_voice_16_1
	config: ja
	split: test
	args:
	language: ja
	metrics:
	- name: Test CER
	type: cer
	value: 19.0

	---

	# Streaming FastConformer-Hybrid Large (Ja)
	This collection contains large size versions of cache-aware FastConformer-Hybrid (around 120M parameters) trained on a Japanse speech. These models are trained for streaming ASR with look-ahead of 1040ms which be used for very low-latency streaming applications. The model is hybrid with both Transducer and CTC decoders.

	## Model Architecture

	These models are cache-aware versions of Hybrid FastConfomer which are trained for streaming ASR. You may find more info on cache-aware models here: [Cache-aware Streaming Conformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#cache-aware-streaming-conformer) .
	The models are trained with multiple look-aheads which makes the model to be able to support different latencies.
	To learn on how to switch between different look-ahead, you may read the documentation on the cache-aware models.

	### Datasets

	The model in this collection is trained on two datasets comprising approxinately 20000 hours of Janpanese speech:

	- Mozilla Common Voice Ja(v23.0)
	- AsrSet_Ja

	## Performance

	The following table summarizes the performance of this model in terms of Character Error Rate (CER%).

	In CER calculation, punctuation marks and non-alphabet characters are removed, and numbers are transformed to words using `num2words` library.

	\|Version\|Decoder\|JSUT basic5000\|MCV 8.0 test\|MCV 16.1 dev\|MCV 16.1 test\|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 1.1.0 \| CTC \| 10.18 \| 10.53 \| 14.47 \| 19.0 \|