RoBERTa MLM Pretrained Model

This model was pretrained using Masked Language Modeling (MLM) objective on multilingual text data.

Model Description

This is a RoBERTa-based transformer model pretrained from scratch (or fine-tuned) using the Masked Language Modeling objective. The model learns to predict masked tokens in input sequences, developing a strong understanding of language patterns and semantics.

Model Architecture:

  • Hidden Layers: 6
  • Hidden Dimensions: 512
  • Attention Heads: 8
  • Maximum Sequence Length: 640

Training Details

Training Data

  • Dataset: dstilesr/glotlid-balanced-train
  • Version: 2025.09.101615

Training Hyperparameters

  • Epochs: 2
  • Batch Size: 112
  • Learning Rate: 0.00014
  • Optimizer: adamw_torch_fused
  • MLM Probability: 0.2
  • Weight Decay: 0.0
  • Warmup Steps: 128
  • Gradient Accumulation Steps: 2

Framework

  • Library: Transformers (Hugging Face)
  • Training Framework: PyTorch

Usage

Masked Language Modeling

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("dstilesr/glotlid-pretrained-roberta")
model = AutoModelForMaskedLM.from_pretrained("dstilesr/glotlid-pretrained-roberta")

# Example: Fill mask
text = "The capital of France is <mask>."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Fine-tuning for Sequence Classification

This pretrained model can be fine-tuned for downstream tasks like sequence classification:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "dstilesr/glotlid-pretrained-roberta",
    num_labels=num_classes,
    ignore_mismatched_sizes=True
)

Intended Use

This model is designed as a pretrained base model for various NLP tasks:

  • Fine-tuning for text classification
  • Fine-tuning for sequence labeling
  • Feature extraction for downstream tasks
  • Transfer learning for low-resource languages

Limitations

  • Maximum input length is 640 tokens
  • Performance depends on similarity between pretraining and downstream data
  • May require task-specific fine-tuning for optimal performance
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dstilesr/glotlid-pretrained-roberta