Cookie Classification Model: Linear Regression (TF-IDF + NAME Features)

Model Description

This is a Linear Regression model trained for cookie classification. The model uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with both word n-grams (1-2) and character n-grams (3-5), combined with engineered name features to classify web cookies into 4 privacy categories:

  • Class 0: Strictly Necessary
  • Class 1: Functionality
  • Class 2: Analytics
  • Class 3: Advertising/Tracking

Model Performance

The model achieves the following performance metrics on the test set:

Class Category Precision Recall F1-Score Support
0 Strictly Necessary 0.92 0.90 0.91 7987
1 Functionality 0.64 0.61 0.62 1663
2 Analytics 0.89 0.93 0.91 8536
3 Advertising/Tracking 0.92 0.91 0.92 10485

Overall Accuracy: 0.90 (90%)

Weighted Average:

  • Precision: 0.90
  • Recall: 0.90
  • F1-Score: 0.90

Usage

Loading the Model

import joblib
import numpy as np

# Load the model
model = joblib.load('LR_TFIDF+NAME.joblib')

# The model expects preprocessed TF-IDF features
# Make predictions
predictions = model.predict(X_test)

# Get prediction probabilities (if supported)
# Note: Linear Regression for classification may not have predict_proba
# You may need to use decision_function instead
scores = model.decision_function(X_test)

Using with Hugging Face Hub

from huggingface_hub import hf_hub_download
import joblib

# Download the model from Hugging Face Hub
model_path = hf_hub_download(
    repo_id="aqibtahir/cookie-classifier-lr-tfidf",
    filename="LR_TFIDF+NAME.joblib"
)

# Load the model
model = joblib.load(model_path)

# Use the model for predictions (with preprocessed features)
# Note: You'll need the TF-IDF vectorizers and name feature extractor
predictions = model.predict(your_features)

Training Details

  • Algorithm: Linear Regression (used for multi-class classification)
  • Features:
    • TF-IDF word n-grams (1-2), max_features=200,000
    • TF-IDF char n-grams (3-5), max_features=200,000
    • Engineered name features (length, digits, special chars, tracker tokens, prefixes, suffixes, etc.)
  • Number of Classes: 4 (Cookie privacy categories)
  • Training Samples: 28,671 samples (80/10/10 train/val/test split)
  • Input: Cookie names (short text strings)

Limitations and Bias

  • Class Imbalance: The Functionality category (Class 1) shows lower performance (F1-score: 0.62) compared to other classes, likely due to fewer training samples (1,663 vs 7,000-10,000 for other classes).
  • Preprocessing Required: The model requires the same TF-IDF vectorizers (word and char) and name feature engineering pipeline used during training.
  • Domain Specificity: Model is trained specifically on cookie names and may not generalize to other text classification tasks.
  • Cookie Name Format: Best performance on typical cookie naming patterns; unusual formats may affect accuracy.

Intended Use

This model is intended for automated cookie classification to help with:

  • Privacy compliance (GDPR, CCPA)
  • Cookie consent management platforms
  • Website privacy audits
  • Cookie banner categorization

Requirements:

  1. Input must be cookie names (short text strings)
  2. Preprocessing must use the same TF-IDF vectorizers and name feature extraction
  3. Classification is limited to the 4 predefined cookie privacy categories

Citation

If you use this model, please cite appropriately and mention the training methodology.

Model Card Authors

Created on October 30, 2025

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using aqibtahir/cookie-classifier-lr-tfidf 1