Model Card for ele-sage/distilbert-base-uncased-name-splitter
This is a token classification model fine-tuned from distilbert-base-uncased to parse and split full human names into their first name and last name components.
The model is trained to recognize multi-word first names (e.g., "Mary Anne") and multi-word last names (e.g., "van der Sar"). It is also robust to the order of names, having been trained on three formats:
FirstName LastNameLastName, FirstNameLastName FirstName
It identifies tokens with the following labels: B-FNAME (Beginning of a First Name), I-FNAME (Inside a First Name), B-LNAME (Beginning of a Last Name), I-LNAME (Inside a Last Name), and O (Outside, for tokens like commas).
Model Details
Model Description
- Developed by: ele-sage
- Model type:
distilbert-for-token-classification - Language(s) (NLP): English, French
- License: MIT
- Finetuned from model:
distilbert/distilbert-base-uncased
Model Sources
Uses
Direct Use
This model is intended to be used for Named Entity Recognition (NER), specifically for parsing full name strings into structured FirstName and LastName components. It can be used directly with the ner (or token-classification) pipeline in the 🤗 Transformers library.
from transformers import pipeline
# Use the 'ner' pipeline with an aggregation strategy to automatically group B-I tags
name_parser = pipeline("ner", model="ele-sage/distilbert-base-uncased-name-splitter", aggregation_strategy="simple")
# Example 1: Standard Order
name_parser("Alonso Sarmiento Martinez")
# Expected Output:
# [{'entity_group': 'FNAME', 'score': ..., 'word': 'Alonso'},
# {'entity_group': 'LNAME', 'score': ..., 'word': 'Sarmiento Martinez'}]
# Example 2: Swapped Order with Comma
name_parser("Wolf, Michel Konstantin")
# Expected Output:
# [{'entity_group': 'LNAME', 'score': ..., 'word': 'Wolf'},
# {'entity_group': 'FNAME', 'score': ..., 'word': 'Michel Konstantin'}]
Out-of-Scope Use
This model is highly specialized. It should not be used for:
- General NER: It is not trained to find locations, organizations, dates, etc.
- Parsing Non-Name Text: The model will produce nonsensical and unpredictable results if given text that is not a person's name.
- Classifying Companies: This model is not a classifier. It assumes its input is a person's name. For identifying companies vs. persons, the
ele-sage/distilbert-base-uncased-name-classifiermodel should be used first.
Bias, Risks, and Limitations
- Data Source Bias: The primary training data is derived from a dataset of Canadian names, which itself is sourced from a Facebook data leak. It is therefore heavily biased towards North American and European name structures.
- Ambiguity: For culturally ambiguous names without a comma (e.g., "Kelly Glenn"), the model makes a statistical guess based on the frequencies learned during training. This guess can be incorrect.
- Noise: The original dataset contains a significant amount of noise and non-name entries. While a multi-stage cleaning process was applied, some noise may persist, and the model's behavior on such inputs is not guaranteed.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import pipeline
# Load the pipeline
# aggregation_strategy="simple" is recommended to automatically group multi-word names
name_parser = pipeline(
"ner",
model="ele-sage/distilbert-base-uncased-name-splitter",
aggregation_strategy="simple"
)
# --- Examples ---
names = [
"Michel Konstantin Wolf",
"Bunting Roon, Doralie",
"Sarmiento Martinez Alonso" # Ambiguous case without comma
]
for name in names:
parsed_name = name_parser(name)
print(f"Input: '{name}'")
print(f" -> Parsed: {parsed_name}\n")
Training Details
Training Data
The model was trained on a custom-curated dataset derived from two primary sources:
- Person Names Source: A large CSV file of over 3.4 million Canadian names, originally sourced from a Facebook data leak.
- Company Names Source: The public data file from the Quebec Enterprise Register, used to train the classifier that cleaned the person name data.
Training Procedure
The dataset was curated and processed through a multi-stage pipeline before training:
Preprocessing
AI-Powered Cleaning: The raw person name CSV (3.4M rows) was first processed by the
ele-sage/distilbert-base-uncased-name-classifiermodel. Any entry that the classifier identified as a "Company" or had low confidence of being a "Person" was discarded. This step removed a significant amount of non-name noise (e.g., "The Bahais of Prince George", "Saskatchewan Intl Freight").Rule-Based Filtering: The AI-cleaned data was further filtered to:
- Remove entries where the first and last names were identical.
- Remove entries containing characters outside of a pre-defined set of English, French, and common punctuation characters.
Data Augmentation: The final, cleaned dataset was augmented to create three different name formats to improve robustness:
- 50% of the data was kept in the standard
FirstName LastNameorder. - 25% was converted to the
LastName, FirstNameformat, with a comma token tagged asO. - 25% was converted to the ambiguous
LastName FirstNameformat (without a comma).
- 50% of the data was kept in the standard
The final dataset was then shuffled to ensure all formats were present in each training batch.
Training Hyperparameters
- Framework: Transformers
Trainer - Training regime:
bf16mixed precision - Epochs: 8 (Configured Maximum)
- Batch Size: 1024 (
per_device_train_batch_size) - Optimizer: AdamW
- Learning Rate:
2e-5 - LR Scheduler: Linear decay
- Warmup Steps: 250
- Weight Decay: 0.01
- Evaluation Strategy: Every
1000steps
Evaluation
Metrics
The model's performance is evaluated using standard token classification metrics, with F1-Score being the primary metric. F1 is chosen over Accuracy because it provides a balanced measure of Precision and Recall, which is crucial for imbalanced NER tasks where the "haystack" of non-entity tokens is much larger than the "needles" (the name tokens).
- Precision: Of all the tokens the model labeled as a name part, how many were correct?
- Recall: Of all the actual name parts in the data, how many did the model find?
- F1-Score: The harmonic mean of Precision and Recall.
Results
The final model was selected based on the highest F1-score achieved on the validation set during training.
| Metric | Value |
|---|---|
| eval_f1 | 0.957 |
| eval_precision | 0.957 |
| eval_recall | 0.957 |
| eval_accuracy | 0.962 |
| eval_loss | 0.106 |
This result was achieved at Step 20,000 of the training run, after which performance on the validation set began to plateau.
- Downloads last month
- 10
Model tree for ele-sage/distilbert-base-uncased-name-splitter
Base model
distilbert/distilbert-base-uncased