aqibtahir commited on
Commit
dc76d59
·
verified ·
1 Parent(s): d41ee40

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: scikit-learn
3
+ tags:
4
+ - sklearn
5
+ - linear-regression
6
+ - text-classification
7
+ - tfidf
8
+ - cookie-classification
9
+ - privacy
10
+ - web-cookies
11
+ license: mit
12
+ ---
13
+
14
+ # Cookie Classification Model: Linear Regression (TF-IDF + NAME Features)
15
+
16
+ ## Model Description
17
+
18
+ This is a Linear Regression model trained for cookie classification. The model uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with both word n-grams (1-2) and character n-grams (3-5), combined with engineered name features to classify web cookies into 4 privacy categories:
19
+
20
+ - **Class 0**: Strictly Necessary
21
+ - **Class 1**: Functionality
22
+ - **Class 2**: Analytics
23
+ - **Class 3**: Advertising/Tracking
24
+
25
+ ## Model Performance
26
+
27
+ The model achieves the following performance metrics on the test set:
28
+
29
+ | Class | Category | Precision | Recall | F1-Score | Support |
30
+ |-------|----------|-----------|--------|----------|---------|
31
+ | 0 | Strictly Necessary | 0.92 | 0.90 | 0.91 | 7987 |
32
+ | 1 | Functionality | 0.64 | 0.61 | 0.62 | 1663 |
33
+ | 2 | Analytics | 0.89 | 0.93 | 0.91 | 8536 |
34
+ | 3 | Advertising/Tracking | 0.92 | 0.91 | 0.92 | 10485 |
35
+
36
+ **Overall Accuracy:** 0.90 (90%)
37
+
38
+ **Weighted Average:**
39
+ - Precision: 0.90
40
+ - Recall: 0.90
41
+ - F1-Score: 0.90
42
+
43
+ ## Usage
44
+
45
+ ### Loading the Model
46
+
47
+ ```python
48
+ import joblib
49
+ import numpy as np
50
+
51
+ # Load the model
52
+ model = joblib.load('LR_TFIDF+NAME.joblib')
53
+
54
+ # The model expects preprocessed TF-IDF features
55
+ # Make predictions
56
+ predictions = model.predict(X_test)
57
+
58
+ # Get prediction probabilities (if supported)
59
+ # Note: Linear Regression for classification may not have predict_proba
60
+ # You may need to use decision_function instead
61
+ scores = model.decision_function(X_test)
62
+ ```
63
+
64
+ ### Using with Hugging Face Hub
65
+
66
+ ```python
67
+ from huggingface_hub import hf_hub_download
68
+ import joblib
69
+
70
+ # Download the model from Hugging Face Hub
71
+ model_path = hf_hub_download(
72
+ repo_id="aqibtahir/cookie-classifier-lr-tfidf",
73
+ filename="LR_TFIDF+NAME.joblib"
74
+ )
75
+
76
+ # Load the model
77
+ model = joblib.load(model_path)
78
+
79
+ # Use the model for predictions (with preprocessed features)
80
+ # Note: You'll need the TF-IDF vectorizers and name feature extractor
81
+ predictions = model.predict(your_features)
82
+ ```
83
+
84
+ ## Training Details
85
+
86
+ - **Algorithm:** Linear Regression (used for multi-class classification)
87
+ - **Features:**
88
+ - TF-IDF word n-grams (1-2), max_features=200,000
89
+ - TF-IDF char n-grams (3-5), max_features=200,000
90
+ - Engineered name features (length, digits, special chars, tracker tokens, prefixes, suffixes, etc.)
91
+ - **Number of Classes:** 4 (Cookie privacy categories)
92
+ - **Training Samples:** 28,671 samples (80/10/10 train/val/test split)
93
+ - **Input:** Cookie names (short text strings)
94
+
95
+ ## Limitations and Bias
96
+
97
+ - **Class Imbalance**: The Functionality category (Class 1) shows lower performance (F1-score: 0.62) compared to other classes, likely due to fewer training samples (1,663 vs 7,000-10,000 for other classes).
98
+ - **Preprocessing Required**: The model requires the same TF-IDF vectorizers (word and char) and name feature engineering pipeline used during training.
99
+ - **Domain Specificity**: Model is trained specifically on cookie names and may not generalize to other text classification tasks.
100
+ - **Cookie Name Format**: Best performance on typical cookie naming patterns; unusual formats may affect accuracy.
101
+
102
+ ## Intended Use
103
+
104
+ This model is intended for **automated cookie classification** to help with:
105
+
106
+ - Privacy compliance (GDPR, CCPA)
107
+ - Cookie consent management platforms
108
+ - Website privacy audits
109
+ - Cookie banner categorization
110
+
111
+ **Requirements:**
112
+
113
+ 1. Input must be cookie names (short text strings)
114
+ 2. Preprocessing must use the same TF-IDF vectorizers and name feature extraction
115
+ 3. Classification is limited to the 4 predefined cookie privacy categories
116
+
117
+ ## Citation
118
+
119
+ If you use this model, please cite appropriately and mention the training methodology.
120
+
121
+ ## Model Card Authors
122
+
123
+ Created on October 30, 2025