PhoBERT: Pre-trained language models for Vietnamese
Paper
•
2003.00744
•
Published
•
1
Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese (Pho, i.e. "Phở", is a popular food in Vietnam):
The general architecture and experimental results of PhoBERT can be found in our EMNLP-2020 Findings paper:
@article{phobert,
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
journal = {Findings of EMNLP},
year = {2020}
}
Please CITE our paper when PhoBERT is used to help produce published results or is incorporated into other software.
For further information or requests, please go to PhoBERT's homepage!
transformers:
- git clone https://github.com/huggingface/transformers.git
- cd transformers
- pip3 install --upgrade .| Model | #params | Arch. | Pre-training data |
|---|---|---|---|
vinai/phobert-base |
135M | base | 20GB of texts |
vinai/phobert-large |
370M | large | 20GB of texts |
import torch
from transformers import AutoModel, AutoTokenizer
phobert = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = phobert(input_ids) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")