--- library_name: scikit-learn tags: - sklearn - linear-regression - text-classification - tfidf - cookie-classification - privacy - web-cookies license: mit --- # Cookie Classification Model: Linear Regression (TF-IDF + NAME Features) ## Model Description This is a Linear Regression model trained for cookie classification. The model uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with both word n-grams (1-2) and character n-grams (3-5), combined with engineered name features to classify web cookies into 4 privacy categories: - **Class 0**: Strictly Necessary - **Class 1**: Functionality - **Class 2**: Analytics - **Class 3**: Advertising/Tracking ## Model Performance The model achieves the following performance metrics on the test set: | Class | Category | Precision | Recall | F1-Score | Support | |-------|----------|-----------|--------|----------|---------| | 0 | Strictly Necessary | 0.92 | 0.90 | 0.91 | 7987 | | 1 | Functionality | 0.64 | 0.61 | 0.62 | 1663 | | 2 | Analytics | 0.89 | 0.93 | 0.91 | 8536 | | 3 | Advertising/Tracking | 0.92 | 0.91 | 0.92 | 10485 | **Overall Accuracy:** 0.90 (90%) **Weighted Average:** - Precision: 0.90 - Recall: 0.90 - F1-Score: 0.90 ## Usage ### Loading the Model ```python import joblib import numpy as np # Load the model model = joblib.load('LR_TFIDF+NAME.joblib') # The model expects preprocessed TF-IDF features # Make predictions predictions = model.predict(X_test) # Get prediction probabilities (if supported) # Note: Linear Regression for classification may not have predict_proba # You may need to use decision_function instead scores = model.decision_function(X_test) ``` ### Using with Hugging Face Hub ```python from huggingface_hub import hf_hub_download import joblib # Download the model from Hugging Face Hub model_path = hf_hub_download( repo_id="aqibtahir/cookie-classifier-lr-tfidf", filename="LR_TFIDF+NAME.joblib" ) # Load the model model = joblib.load(model_path) # Use the model for predictions (with preprocessed features) # Note: You'll need the TF-IDF vectorizers and name feature extractor predictions = model.predict(your_features) ``` ## Training Details - **Algorithm:** Linear Regression (used for multi-class classification) - **Features:** - TF-IDF word n-grams (1-2), max_features=200,000 - TF-IDF char n-grams (3-5), max_features=200,000 - Engineered name features (length, digits, special chars, tracker tokens, prefixes, suffixes, etc.) - **Number of Classes:** 4 (Cookie privacy categories) - **Training Samples:** 28,671 samples (80/10/10 train/val/test split) - **Input:** Cookie names (short text strings) ## Limitations and Bias - **Class Imbalance**: The Functionality category (Class 1) shows lower performance (F1-score: 0.62) compared to other classes, likely due to fewer training samples (1,663 vs 7,000-10,000 for other classes). - **Preprocessing Required**: The model requires the same TF-IDF vectorizers (word and char) and name feature engineering pipeline used during training. - **Domain Specificity**: Model is trained specifically on cookie names and may not generalize to other text classification tasks. - **Cookie Name Format**: Best performance on typical cookie naming patterns; unusual formats may affect accuracy. ## Intended Use This model is intended for **automated cookie classification** to help with: - Privacy compliance (GDPR, CCPA) - Cookie consent management platforms - Website privacy audits - Cookie banner categorization **Requirements:** 1. Input must be cookie names (short text strings) 2. Preprocessing must use the same TF-IDF vectorizers and name feature extraction 3. Classification is limited to the 4 predefined cookie privacy categories ## Citation If you use this model, please cite appropriately and mention the training methodology. ## Model Card Authors Created on October 30, 2025