Cookie Classification Model: Linear Regression (TF-IDF + NAME Features)
Model Description
This is a Linear Regression model trained for cookie classification. The model uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with both word n-grams (1-2) and character n-grams (3-5), combined with engineered name features to classify web cookies into 4 privacy categories:
- Class 0: Strictly Necessary
- Class 1: Functionality
- Class 2: Analytics
- Class 3: Advertising/Tracking
Model Performance
The model achieves the following performance metrics on the test set:
| Class | Category | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|---|
| 0 | Strictly Necessary | 0.92 | 0.90 | 0.91 | 7987 |
| 1 | Functionality | 0.64 | 0.61 | 0.62 | 1663 |
| 2 | Analytics | 0.89 | 0.93 | 0.91 | 8536 |
| 3 | Advertising/Tracking | 0.92 | 0.91 | 0.92 | 10485 |
Overall Accuracy: 0.90 (90%)
Weighted Average:
- Precision: 0.90
- Recall: 0.90
- F1-Score: 0.90
Usage
Loading the Model
import joblib
import numpy as np
# Load the model
model = joblib.load('LR_TFIDF+NAME.joblib')
# The model expects preprocessed TF-IDF features
# Make predictions
predictions = model.predict(X_test)
# Get prediction probabilities (if supported)
# Note: Linear Regression for classification may not have predict_proba
# You may need to use decision_function instead
scores = model.decision_function(X_test)
Using with Hugging Face Hub
from huggingface_hub import hf_hub_download
import joblib
# Download the model from Hugging Face Hub
model_path = hf_hub_download(
repo_id="aqibtahir/cookie-classifier-lr-tfidf",
filename="LR_TFIDF+NAME.joblib"
)
# Load the model
model = joblib.load(model_path)
# Use the model for predictions (with preprocessed features)
# Note: You'll need the TF-IDF vectorizers and name feature extractor
predictions = model.predict(your_features)
Training Details
- Algorithm: Linear Regression (used for multi-class classification)
- Features:
- TF-IDF word n-grams (1-2), max_features=200,000
- TF-IDF char n-grams (3-5), max_features=200,000
- Engineered name features (length, digits, special chars, tracker tokens, prefixes, suffixes, etc.)
- Number of Classes: 4 (Cookie privacy categories)
- Training Samples: 28,671 samples (80/10/10 train/val/test split)
- Input: Cookie names (short text strings)
Limitations and Bias
- Class Imbalance: The Functionality category (Class 1) shows lower performance (F1-score: 0.62) compared to other classes, likely due to fewer training samples (1,663 vs 7,000-10,000 for other classes).
- Preprocessing Required: The model requires the same TF-IDF vectorizers (word and char) and name feature engineering pipeline used during training.
- Domain Specificity: Model is trained specifically on cookie names and may not generalize to other text classification tasks.
- Cookie Name Format: Best performance on typical cookie naming patterns; unusual formats may affect accuracy.
Intended Use
This model is intended for automated cookie classification to help with:
- Privacy compliance (GDPR, CCPA)
- Cookie consent management platforms
- Website privacy audits
- Cookie banner categorization
Requirements:
- Input must be cookie names (short text strings)
- Preprocessing must use the same TF-IDF vectorizers and name feature extraction
- Classification is limited to the 4 predefined cookie privacy categories
Citation
If you use this model, please cite appropriately and mention the training methodology.
Model Card Authors
Created on October 30, 2025