aqibtahir
/

cookie-classifier-lr-tfidf

+---
+library_name: scikit-learn
+tags:
+- sklearn
+- linear-regression
+- text-classification
+- tfidf
+- cookie-classification
+- privacy
+- web-cookies
+license: mit
+---
+# Cookie Classification Model: Linear Regression (TF-IDF + NAME Features)
+## Model Description
+This is a Linear Regression model trained for cookie classification. The model uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with both word n-grams (1-2) and character n-grams (3-5), combined with engineered name features to classify web cookies into 4 privacy categories:
+- **Class 0**: Strictly Necessary
+- **Class 1**: Functionality
+- **Class 2**: Analytics
+- **Class 3**: Advertising/Tracking
+## Model Performance
+The model achieves the following performance metrics on the test set:
+| Class | Category | Precision | Recall | F1-Score | Support |
+|-------|----------|-----------|--------|----------|---------|
+| 0     | Strictly Necessary | 0.92 | 0.90 | 0.91 | 7987 |
+| 1     | Functionality      | 0.64 | 0.61 | 0.62 | 1663 |
+| 2     | Analytics          | 0.89 | 0.93 | 0.91 | 8536 |
+| 3     | Advertising/Tracking | 0.92 | 0.91 | 0.92 | 10485 |
+**Overall Accuracy:** 0.90 (90%)
+**Weighted Average:**
+- Precision: 0.90
+- Recall: 0.90
+- F1-Score: 0.90
+## Usage
+### Loading the Model
+```python
+import joblib
+import numpy as np
+# Load the model
+model = joblib.load('LR_TFIDF+NAME.joblib')
+# The model expects preprocessed TF-IDF features
+# Make predictions
+predictions = model.predict(X_test)
+# Get prediction probabilities (if supported)
+# Note: Linear Regression for classification may not have predict_proba
+# You may need to use decision_function instead
+scores = model.decision_function(X_test)
+```
+### Using with Hugging Face Hub
+```python
+from huggingface_hub import hf_hub_download
+import joblib
+# Download the model from Hugging Face Hub
+model_path = hf_hub_download(
+    repo_id="aqibtahir/cookie-classifier-lr-tfidf",
+    filename="LR_TFIDF+NAME.joblib"
+)
+# Load the model
+model = joblib.load(model_path)
+# Use the model for predictions (with preprocessed features)
+# Note: You'll need the TF-IDF vectorizers and name feature extractor
+predictions = model.predict(your_features)
+```
+## Training Details
+- **Algorithm:** Linear Regression (used for multi-class classification)
+- **Features:**
+  - TF-IDF word n-grams (1-2), max_features=200,000
+  - TF-IDF char n-grams (3-5), max_features=200,000
+  - Engineered name features (length, digits, special chars, tracker tokens, prefixes, suffixes, etc.)
+- **Number of Classes:** 4 (Cookie privacy categories)
+- **Training Samples:** 28,671 samples (80/10/10 train/val/test split)
+- **Input:** Cookie names (short text strings)
+## Limitations and Bias
+- **Class Imbalance**: The Functionality category (Class 1) shows lower performance (F1-score: 0.62) compared to other classes, likely due to fewer training samples (1,663 vs 7,000-10,000 for other classes).
+- **Preprocessing Required**: The model requires the same TF-IDF vectorizers (word and char) and name feature engineering pipeline used during training.
+- **Domain Specificity**: Model is trained specifically on cookie names and may not generalize to other text classification tasks.
+- **Cookie Name Format**: Best performance on typical cookie naming patterns; unusual formats may affect accuracy.
+## Intended Use
+This model is intended for **automated cookie classification** to help with:
+- Privacy compliance (GDPR, CCPA)
+- Cookie consent management platforms
+- Website privacy audits
+- Cookie banner categorization
+**Requirements:**
+1. Input must be cookie names (short text strings)
+2. Preprocessing must use the same TF-IDF vectorizers and name feature extraction
+3. Classification is limited to the 4 predefined cookie privacy categories
+## Citation
+If you use this model, please cite appropriately and mention the training methodology.
+## Model Card Authors
+Created on October 30, 2025