---
library_name: scikit-learn
tags:
- sklearn
- linear-regression
- text-classification
- tfidf
- cookie-classification
- privacy
- web-cookies
license: mit
---

# Cookie Classification Model: Linear Regression (TF-IDF + NAME Features)

## Model Description

This is a Linear Regression model trained for cookie classification. The model uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with both word n-grams (1-2) and character n-grams (3-5), combined with engineered name features to classify web cookies into 4 privacy categories:

- **Class 0**: Strictly Necessary
- **Class 1**: Functionality
- **Class 2**: Analytics
- **Class 3**: Advertising/Tracking

## Model Performance

The model achieves the following performance metrics on the test set:

| Class | Category | Precision | Recall | F1-Score | Support |
|-------|----------|-----------|--------|----------|---------|
| 0     | Strictly Necessary | 0.92 | 0.90 | 0.91 | 7987 |
| 1     | Functionality      | 0.64 | 0.61 | 0.62 | 1663 |
| 2     | Analytics          | 0.89 | 0.93 | 0.91 | 8536 |
| 3     | Advertising/Tracking | 0.92 | 0.91 | 0.92 | 10485 |

**Overall Accuracy:** 0.90 (90%)

**Weighted Average:**
- Precision: 0.90
- Recall: 0.90
- F1-Score: 0.90

## Usage

### Loading the Model

```python
import joblib
import numpy as np

# Load the model
model = joblib.load('LR_TFIDF+NAME.joblib')

# The model expects preprocessed TF-IDF features
# Make predictions
predictions = model.predict(X_test)

# Get prediction probabilities (if supported)
# Note: Linear Regression for classification may not have predict_proba
# You may need to use decision_function instead
scores = model.decision_function(X_test)
```

### Using with Hugging Face Hub

```python
from huggingface_hub import hf_hub_download
import joblib

# Download the model from Hugging Face Hub
model_path = hf_hub_download(
    repo_id="aqibtahir/cookie-classifier-lr-tfidf",
    filename="LR_TFIDF+NAME.joblib"
)

# Load the model
model = joblib.load(model_path)

# Use the model for predictions (with preprocessed features)
# Note: You'll need the TF-IDF vectorizers and name feature extractor
predictions = model.predict(your_features)
```

## Training Details

- **Algorithm:** Linear Regression (used for multi-class classification)
- **Features:** 
  - TF-IDF word n-grams (1-2), max_features=200,000
  - TF-IDF char n-grams (3-5), max_features=200,000
  - Engineered name features (length, digits, special chars, tracker tokens, prefixes, suffixes, etc.)
- **Number of Classes:** 4 (Cookie privacy categories)
- **Training Samples:** 28,671 samples (80/10/10 train/val/test split)
- **Input:** Cookie names (short text strings)

## Limitations and Bias

- **Class Imbalance**: The Functionality category (Class 1) shows lower performance (F1-score: 0.62) compared to other classes, likely due to fewer training samples (1,663 vs 7,000-10,000 for other classes).
- **Preprocessing Required**: The model requires the same TF-IDF vectorizers (word and char) and name feature engineering pipeline used during training.
- **Domain Specificity**: Model is trained specifically on cookie names and may not generalize to other text classification tasks.
- **Cookie Name Format**: Best performance on typical cookie naming patterns; unusual formats may affect accuracy.

## Intended Use

This model is intended for **automated cookie classification** to help with:

- Privacy compliance (GDPR, CCPA)
- Cookie consent management platforms
- Website privacy audits
- Cookie banner categorization

**Requirements:**

1. Input must be cookie names (short text strings)
2. Preprocessing must use the same TF-IDF vectorizers and name feature extraction
3. Classification is limited to the 4 predefined cookie privacy categories

## Citation

If you use this model, please cite appropriately and mention the training methodology.

## Model Card Authors

Created on October 30, 2025