| | --- |
| | tags: |
| | - text2text-generation |
| | - Chinese |
| | - seq2seq |
| | - grammar |
| | language: zh |
| | license: apache-2.0 |
| | --- |
| | # Pseudo-Native-BART-CGEC |
| |
|
| | This model is a cutting-edge CGEC model based on [Chinese BART-large](https://huggingface.co/fnlp/bart-large-chinese). |
| | It is trained with HSK and Lang8 learner CGEC data (about 1.3M). |
| | More details can be found in our [Github](https://github.com/HillZhang1999/NaSGEC) and the [paper](https://arxiv.org/pdf/2305.16023.pdf). |
| |
|
| | ## Usage |
| |
|
| | pip install transformers |
| |
|
| | ``` |
| | from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline |
| | tokenizer = BertTokenizer.from_pretrained("HillZhang/real_learner_bart_CGEC") |
| | model = BartForConditionalGeneration.from_pretrained("HillZhang/real_learner_bart_CGEC") |
| | encoded_input = tokenizer(["北京是中国的都。", "他说:”我最爱的运动是打蓝球“", "我每天大约喝5次水左右。", "今天,我非常开开心。"], return_tensors="pt", padding=True, truncation=True) |
| | if "token_type_ids" in encoded_input: |
| | del encoded_input["token_type_ids"] |
| | output = model.generate(**encoded_input) |
| | print(tokenizer.batch_decode(output, skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ``` |
| | @inproceedings{zhang-etal-2023-nasgec, |
| | title = "{Na}{SGEC}: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts", |
| | author = "Zhang, Yue and |
| | Zhang, Bo and |
| | Jiang, Haochen and |
| | Li, Zhenghua and |
| | Li, Chen and |
| | Huang, Fei and |
| | Zhang, Min" |
| | booktitle = "Findings of ACL", |
| | year = "2023" |
| | } |
| | ``` |