Update README.md

fde78d7 verified 4 months ago

5.45 kB

	---
	language: es
	library_name: transformers
	license: apache-2.0
	tags:
	- roberta
	- spanish
	- scientific
	- fill-mask
	---

	# Sci-BETO-base

	Sci-BETO is a domain-specific RoBERTa encoder pretrained entirely on Spanish scientific texts.

	---

	## Model Description

	Sci-BETO-base is a transformer-based encoder following the RoBERTa architecture (125M parameters).
	It was pretrained from scratch using byte-level BPE tokenization on a large corpus of Spanish open-access scientific publications, including theses, dissertations, and peer-reviewed papers from Colombian universities and international repositories.

	The model was designed to capture scientific discourse, terminology, and abstract reasoning patterns typical of research documents in economics, engineering, medicine, and the social sciences.

	\| Property \| Value \|
	\|-----------\|--------\|
	\| Architecture \| RoBERTa-base \|
	\| Parameters \| ~125M \|
	\| Vocabulary size \| 50,262 \|
	\| Tokenizer \| Byte-Level BPE (trained from scratch) \|
	\| Pretraining objective \| Masked Language Modeling (MLM) \|
	\| Pretraining steps \| 85K \|
	\| Max sequence length \| 512 tokens \|
	\| Framework \| Transformers \|

	---

	## Pretraining Data

	The pretraining corpus includes over 11 billion tokens from Spanish academic and scientific sources:

	- Open-access repositories of Colombian universities (Universidad de los Andes, Universidad Nacional, Universidad Javeriana, Universidad del Rosario).
	- CORE API and institutional repositories (theses, dissertations, working papers).
	- Tax Statutes in Colombia

	The final dataset covers multiple disciplines (economics, medicine, engineering, humanities), ensuring representation across scientific domains.

	\| Source \| # Documents \| # Words (deduplicated) \| Percentage (%) \|
	\|--------------------------------\|----------------:\|----------------------------:\|-------------------:\|
	\| Universidad de los Andes \| 33,858 \| 365,752,780 \| 3.23 \|
	\| Universidad Nacional \| 44,686 \| 537,022,975 \| 4.75 \|
	\| CORE API \| 2,181,689 \| 9,624,189,002 \| 85.10 \|
	\| Universidad del Rosario \| 22,404 \| 183,356,109 \| 1.62 \|
	\| Universidad Javeriana \| 25,624 \| 323,918,445 \| 2.86 \|
	\| Tax Statutes in Colombia \| 392 \| 13,924,060 \| 0.12 \|
	\| Extra \| 2 \| 261,131,453 \| 2.31 \|
	\| Total \| 2,308,655 \| 11,309,295,824 \| 100.00 \|

	---

	## Benchmarks

	Sci-BETO was fine-tuned and benchmarked across multiple downstream tasks, both general-domain and scientific:

	\| Dataset \| Metric \| Sci-BETO Large \| Sci-BETO Base \| BETO \| BERTIN \|
	\|---------------------\|----------------\|-------------------:\|------------------:\|----------:\|------------:\|
	\| WikiCAT \| F1 (macro) \| 0.7738 \| 0.7583 \| 0.7624 \| 0.7598 \|
	\| PAWS-X (es) \| F1 (macro) \| 0.9148 \| 0.8794 \| 0.8985 \| 0.8961 \|
	\| PharmaCoNER \| F1 (micro) \| 0.8959 \| 0.8733 \| 0.8845 \| 0.8802 \|
	\| CANTEMIST \| F1 (micro) \| 0.8809 \| 0.8784 \| 0.8954 \| 0.8956 \|
	\| NLI (ESNLI-R) \| F1 (micro) \| — \| — \| — \| — \|
	\| BanRep (JEL) \| Exact Match \| 0.6116 \| 0.6043 \| 0.5933 \| 0.5807 \|
	\| Rosario \| F1 (macro) \| 0.9203 \| 0.9194 \| 0.9079 \| 0.9121 \|
	\| Econ-IE \| F1 (micro) \| 0.5256 \| 0.5158 \| 0.5199 \| 0.4992 \|


	On average, Sci-BETO achieves comparable or superior results to general-domain Spanish models in specialized contexts (scientific, biomedical, economic), while maintaining strong performance in general text understanding.

	---

	## Intended Use

	- Research and experimentation in Spanish scientific NLP.
	- Downstream fine-tuning for:
	- Text classification (scientific or academic domains),
	- Named Entity Recognition (NER),
	- Semantic similarity and paraphrase detection,
	- Knowledge extraction from academic documents.

	---

	## Limitations

	- The model may underperform on highly informal or non-academic Spanish (e.g., social media).
	- It is not designed for generative tasks (e.g., text completion, chat).
	- Domain bias toward academic register and Latin American Spanish variants.
	- Pretraining corpus excludes English or bilingual data.

	---

	## Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("Flaglab/Sci-BETO-base")
	model = AutoModelForMaskedLM.from_pretrained("Flaglab/Sci-BETO-base")

	text = "El Banco de la República va a subir las [mask] de interes."
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	logits = outputs.logits
	masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
	predicted_token = tokenizer.decode(logits[0, masked_index].argmax(dim=-1))
	print("Predicted token:", predicted_token)