Model Card for ele-sage/distilbert-base-uncased-name-splitter

This is a token classification model fine-tuned from distilbert-base-uncased to parse and split full human names into their first name and last name components.

The model is trained to recognize multi-word first names (e.g., "Mary Anne") and multi-word last names (e.g., "van der Sar"). It is also robust to the order of names, having been trained on three formats:

  • FirstName LastName
  • LastName, FirstName
  • LastName FirstName

It identifies tokens with the following labels: B-FNAME (Beginning of a First Name), I-FNAME (Inside a First Name), B-LNAME (Beginning of a Last Name), I-LNAME (Inside a Last Name), and O (Outside, for tokens like commas).

Model Details

Model Description

  • Developed by: ele-sage
  • Model type: distilbert-for-token-classification
  • Language(s) (NLP): English, French
  • License: MIT
  • Finetuned from model: distilbert/distilbert-base-uncased

Model Sources

Uses

Direct Use

This model is intended to be used for Named Entity Recognition (NER), specifically for parsing full name strings into structured FirstName and LastName components. It can be used directly with the ner (or token-classification) pipeline in the 🤗 Transformers library.

from transformers import pipeline

# Use the 'ner' pipeline with an aggregation strategy to automatically group B-I tags
name_parser = pipeline("ner", model="ele-sage/distilbert-base-uncased-name-splitter", aggregation_strategy="simple")

# Example 1: Standard Order
name_parser("Alonso Sarmiento Martinez")
# Expected Output:
# [{'entity_group': 'FNAME', 'score': ..., 'word': 'Alonso'},
#  {'entity_group': 'LNAME', 'score': ..., 'word': 'Sarmiento Martinez'}]

# Example 2: Swapped Order with Comma
name_parser("Wolf, Michel Konstantin")
# Expected Output:
# [{'entity_group': 'LNAME', 'score': ..., 'word': 'Wolf'},
#  {'entity_group': 'FNAME', 'score': ..., 'word': 'Michel Konstantin'}]

Out-of-Scope Use

This model is highly specialized. It should not be used for:

  • General NER: It is not trained to find locations, organizations, dates, etc.
  • Parsing Non-Name Text: The model will produce nonsensical and unpredictable results if given text that is not a person's name.
  • Classifying Companies: This model is not a classifier. It assumes its input is a person's name. For identifying companies vs. persons, the ele-sage/distilbert-base-uncased-name-classifier model should be used first.

Bias, Risks, and Limitations

  • Data Source Bias: The primary training data is derived from a dataset of Canadian names, which itself is sourced from a Facebook data leak. It is therefore heavily biased towards North American and European name structures.
  • Ambiguity: For culturally ambiguous names without a comma (e.g., "Kelly Glenn"), the model makes a statistical guess based on the frequencies learned during training. This guess can be incorrect.
  • Noise: The original dataset contains a significant amount of noise and non-name entries. While a multi-stage cleaning process was applied, some noise may persist, and the model's behavior on such inputs is not guaranteed.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import pipeline

# Load the pipeline
# aggregation_strategy="simple" is recommended to automatically group multi-word names
name_parser = pipeline(
    "ner",
    model="ele-sage/distilbert-base-uncased-name-splitter",
    aggregation_strategy="simple"
)

# --- Examples ---
names = [
    "Michel Konstantin Wolf",
    "Bunting Roon, Doralie",
    "Sarmiento Martinez Alonso" # Ambiguous case without comma
]

for name in names:
    parsed_name = name_parser(name)
    print(f"Input: '{name}'")
    print(f"  -> Parsed: {parsed_name}\n")

Training Details

Training Data

The model was trained on a custom-curated dataset derived from two primary sources:

  1. Person Names Source: A large CSV file of over 3.4 million Canadian names, originally sourced from a Facebook data leak.
  2. Company Names Source: The public data file from the Quebec Enterprise Register, used to train the classifier that cleaned the person name data.

Training Procedure

The dataset was curated and processed through a multi-stage pipeline before training:

Preprocessing

  1. AI-Powered Cleaning: The raw person name CSV (3.4M rows) was first processed by the ele-sage/distilbert-base-uncased-name-classifier model. Any entry that the classifier identified as a "Company" or had low confidence of being a "Person" was discarded. This step removed a significant amount of non-name noise (e.g., "The Bahais of Prince George", "Saskatchewan Intl Freight").

  2. Rule-Based Filtering: The AI-cleaned data was further filtered to:

    • Remove entries where the first and last names were identical.
    • Remove entries containing characters outside of a pre-defined set of English, French, and common punctuation characters.
  3. Data Augmentation: The final, cleaned dataset was augmented to create three different name formats to improve robustness:

    • 50% of the data was kept in the standard FirstName LastName order.
    • 25% was converted to the LastName, FirstName format, with a comma token tagged as O.
    • 25% was converted to the ambiguous LastName FirstName format (without a comma).

The final dataset was then shuffled to ensure all formats were present in each training batch.

Training Hyperparameters

  • Framework: Transformers Trainer
  • Training regime: bf16 mixed precision
  • Epochs: 8 (Configured Maximum)
  • Batch Size: 1024 (per_device_train_batch_size)
  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • LR Scheduler: Linear decay
  • Warmup Steps: 250
  • Weight Decay: 0.01
  • Evaluation Strategy: Every 1000 steps

Evaluation

Metrics

The model's performance is evaluated using standard token classification metrics, with F1-Score being the primary metric. F1 is chosen over Accuracy because it provides a balanced measure of Precision and Recall, which is crucial for imbalanced NER tasks where the "haystack" of non-entity tokens is much larger than the "needles" (the name tokens).

  • Precision: Of all the tokens the model labeled as a name part, how many were correct?
  • Recall: Of all the actual name parts in the data, how many did the model find?
  • F1-Score: The harmonic mean of Precision and Recall.

Results

The final model was selected based on the highest F1-score achieved on the validation set during training.

Metric Value
eval_f1 0.957
eval_precision 0.957
eval_recall 0.957
eval_accuracy 0.962
eval_loss 0.106

This result was achieved at Step 20,000 of the training run, after which performance on the validation set began to plateau.

Downloads last month
10
Safetensors
Model size
66.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ele-sage/distilbert-base-uncased-name-splitter

Finetuned
(10558)
this model

Dataset used to train ele-sage/distilbert-base-uncased-name-splitter