HillZhang
/

real_learner_bart_CGEC

text2text-generation

Model card Files Files and versions

real_learner_bart_CGEC / README.md

HillZhang's picture

Update README.md

e4e3a7f almost 3 years ago

|

1.56 kB

	---
	tags:
	- text2text-generation
	- Chinese
	- seq2seq
	- grammar
	language: zh
	license: apache-2.0
	---
	# Pseudo-Native-BART-CGEC

	This model is a cutting-edge CGEC model based on [Chinese BART-large](https://huggingface.co/fnlp/bart-large-chinese).
	It is trained with HSK and Lang8 learner CGEC data (about 1.3M).
	More details can be found in our [Github](https://github.com/HillZhang1999/NaSGEC) and the [paper](https://arxiv.org/pdf/2305.16023.pdf).

	## Usage

	pip install transformers

	```
	from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
	tokenizer = BertTokenizer.from_pretrained("HillZhang/real_learner_bart_CGEC")
	model = BartForConditionalGeneration.from_pretrained("HillZhang/real_learner_bart_CGEC")
	encoded_input = tokenizer(["北京是中国的都。", "他说：”我最爱的运动是打蓝球“", "我每天大约喝5次水左右。", "今天，我非常开开心。"], return_tensors="pt", padding=True, truncation=True)
	if "token_type_ids" in encoded_input:
	del encoded_input["token_type_ids"]
	output = model.generate(**encoded_input)
	print(tokenizer.batch_decode(output, skip_special_tokens=True))
	```

	## Citation

	```
	@inproceedings{zhang-etal-2023-nasgec,
	title = "{Na}{SGEC}: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts",
	author = "Zhang, Yue and
	Zhang, Bo and
	Jiang, Haochen and
	Li, Zhenghua and
	Li, Chen and
	Huang, Fei and
	Zhang, Min"
	booktitle = "Findings of ACL",
	year = "2023"
	}
	```