Cookie Classification Model: Linear Regression (TF-IDF + NAME Features)

Model Description

This is a Linear Regression model trained for cookie classification. The model uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with both word n-grams (1-2) and character n-grams (3-5), combined with engineered name features to classify web cookies into 4 privacy categories:

Class 0: Strictly Necessary
Class 1: Functionality
Class 2: Analytics
Class 3: Advertising/Tracking

Model Performance

The model achieves the following performance metrics on the test set:

Class	Category	Precision	Recall	F1-Score	Support
0	Strictly Necessary	0.92	0.90	0.91	7987
1	Functionality	0.64	0.61	0.62	1663
2	Analytics	0.89	0.93	0.91	8536
3	Advertising/Tracking	0.92	0.91	0.92	10485

Overall Accuracy: 0.90 (90%)

Weighted Average:

Precision: 0.90
Recall: 0.90
F1-Score: 0.90

Usage

Loading the Model

import joblib
import numpy as np

# Load the model
model = joblib.load('LR_TFIDF+NAME.joblib')

# The model expects preprocessed TF-IDF features
# Make predictions
predictions = model.predict(X_test)

# Get prediction probabilities (if supported)
# Note: Linear Regression for classification may not have predict_proba
# You may need to use decision_function instead
scores = model.decision_function(X_test)

Using with Hugging Face Hub

from huggingface_hub import hf_hub_download
import joblib

# Download the model from Hugging Face Hub
model_path = hf_hub_download(
    repo_id="aqibtahir/cookie-classifier-lr-tfidf",
    filename="LR_TFIDF+NAME.joblib"
)

# Load the model
model = joblib.load(model_path)

# Use the model for predictions (with preprocessed features)
# Note: You'll need the TF-IDF vectorizers and name feature extractor
predictions = model.predict(your_features)

Training Details

Algorithm: Linear Regression (used for multi-class classification)
Features:
- TF-IDF word n-grams (1-2), max_features=200,000
- TF-IDF char n-grams (3-5), max_features=200,000
- Engineered name features (length, digits, special chars, tracker tokens, prefixes, suffixes, etc.)
Number of Classes: 4 (Cookie privacy categories)
Training Samples: 28,671 samples (80/10/10 train/val/test split)
Input: Cookie names (short text strings)

Limitations and Bias

Class Imbalance: The Functionality category (Class 1) shows lower performance (F1-score: 0.62) compared to other classes, likely due to fewer training samples (1,663 vs 7,000-10,000 for other classes).
Preprocessing Required: The model requires the same TF-IDF vectorizers (word and char) and name feature engineering pipeline used during training.
Domain Specificity: Model is trained specifically on cookie names and may not generalize to other text classification tasks.
Cookie Name Format: Best performance on typical cookie naming patterns; unusual formats may affect accuracy.

Intended Use

This model is intended for automated cookie classification to help with:

Privacy compliance (GDPR, CCPA)
Cookie consent management platforms
Website privacy audits
Cookie banner categorization

Requirements:

Input must be cookie names (short text strings)
Preprocessing must use the same TF-IDF vectorizers and name feature extraction
Classification is limited to the 4 predefined cookie privacy categories

Citation

If you use this model, please cite appropriately and mention the training methodology.

Model Card Authors

Created on October 30, 2025

Downloads last month: -; Downloads are not tracked for this model. How to track

aqibtahir
/

cookie-classifier-lr-tfidf