Initial upload of phishing-email-detector-capstone

Browse files

Files changed (9) hide show

README.md +145 -0
config.json +35 -0
gitattributes +35 -0
pytorch_model.bin +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +55 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,145 @@

+---
+license: apache-2.0
+base_model: bert-large-uncased
+tags:
+- generated_from_trainer
+- phishing
+- BERT
+- cybersecurity
+- text-classification
+metrics:
+- accuracy
+- precision
+- recall
+model-index:
+- name: phishing-email-detector-capstone
+  results: []
+widget:
+- text: https://www.verif22.com
+  example_title: Phishing URL
+- text: >
+    Dear colleague,
+    An important update about your email has exceeded your storage limit.
+    You will not be able to send or receive messages until you reactivate your account.
+    We will close all older versions of our Mailbox as of Friday, June 12, 2023.
+    To activate and complete the required information, click here (https://ec-ec.squarespace.com).
+    Your account must be reactivated today to regenerate new space.
+    — Management Team
+  example_title: Phishing Email
+- text: >
+    You have access to FREE Video Streaming in your plan.
+    REGISTER with your email and password, then select the monthly subscription option.
+    https://bit.ly/3vNrU5r
+  example_title: Phishing SMS
+- text: >
+    if(data.selectedIndex > 0){$('#hidCflag').val(data.selectedData.value);};
+    var sprypassword1 = new Spry.Widget.ValidationPassword("sprypassword1");
+    var sprytextfield1 = new Spry.Widget.ValidationTextField("sprytextfield1", "email");
+  example_title: Phishing Script
+- text: Hi, this model is really accurate :)
+  example_title: Benign Message
+language:
+- en
+pipeline_tag: text-classification
+---
+# 🧠 Phishing Detection Model (BERT-Large-Uncased)
+A transformer-based model fine-tuned to detect **phishing content** across multiple formats — including **emails, URLs, SMS messages, and scripts**.
+Built on **BERT-Large-Uncased**, it leverages deep contextual understanding of language to classify text as *phishing* or *benign* with high accuracy.
+---
+## 📌 Model Details
+**Base model:** `bert-large-uncased`
+**Architecture:** 24 layers • 1024 hidden size • 16 attention heads • ~336M parameters
+**License:** Apache 2.0
+**Language:** English
+**Pipeline tag:** `text-classification`
+---
+## 🧩 Model Description
+This model was trained to identify phishing-related content by analyzing linguistic and structural patterns commonly found in malicious communications.
+By leveraging BERT’s bidirectional transformer architecture, it effectively detects phishing attempts even when the message appears legitimate or well-written.
+### Key Features
+- Detects **phishing attempts** in text, emails, URLs, and scripts
+- Useful for **cybersecurity applications**, such as email gateways or web filtering systems
+- Capable of identifying **varied phishing tactics** (impersonation, link manipulation, credential harvesting, etc.)
+---
+## 🎯 Intended Uses
+**Recommended use cases:**
+- Classify messages, emails, and URLs as *phishing* or *benign*
+- Integrate into automated **security pipelines**, email filtering tools, or chat moderation systems
+- Aid in **phishing research** or awareness programs
+**Limitations:**
+- May trigger **false positives** on legitimate content with financial or urgent language
+- Optimized for **English text** only
+- Should be part of a **multi-layered defense strategy**, not a standalone cybersecurity control
+---
+## 📊 Evaluation Results
+| Metric | Score |
+|--------|--------|
+| **Loss** | 0.1953 |
+| **Accuracy** | 0.9717 |
+| **Precision** | 0.9658 |
+| **Recall** | 0.9670 |
+| **False Positive Rate** | 0.0249 |
+---
+## ⚙️ Training Details
+### Hyperparameters
+| Parameter | Value |
+|------------|--------|
+| **Learning rate** | 2e-05 |
+| **Train batch size** | 16 |
+| **Eval batch size** | 16 |
+| **Seed** | 42 |
+| **Optimizer** | Adam (β₁=0.9, β₂=0.999, ε=1e-08) |
+| **LR scheduler** | Linear |
+| **Epochs** | 4 |
+### Training Results
+| Training Loss | Epoch | Step  | Validation Loss | Accuracy | Precision | Recall | False Positive Rate |
+|:-------------:|:-----:|:-----:|:---------------:|:--------:|:---------:|:------:|:-------------------:|
+| 0.1487        | 1.0   | 3866  | 0.1454          | 0.9596   | 0.9709    | 0.9320 | 0.0203              |
+| 0.0805        | 2.0   | 7732  | 0.1389          | 0.9691   | 0.9663    | 0.9601 | 0.0243              |
+| 0.0389        | 3.0   | 11598 | 0.1779          | 0.9683   | 0.9778    | 0.9461 | 0.0156              |
+| 0.0091        | 4.0   | 15464 | 0.1953          | 0.9717   | 0.9658    | 0.9670 | 0.0249              |
+---
+## 🧠 Example Inference
+Try the model in Python using the `transformers` library:
+```python
+from transformers import pipeline
+# Load the phishing detection model
+classifier = pipeline("text-classification", model="your-username/phishing-email-detector-capstone")
+# Example texts
+examples = [
+    "Dear colleague, your email storage is full. Click here to verify your account: https://secure-update-login.com",
+    "Hi team, the meeting starts at 2 PM today.",
+    "You have won a free gift card! Claim now at http://bit.ly/3xYzabc"
+]
+# Run inference
+for text in examples:
+    result = classifier(text)[0]
+    print(f"Text: {text}\nPrediction: {result['label']} (score: {result['score']:.4f})\n")

config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "_name_or_path": "bert-large-uncased",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "id2label": {
+    "0": "benign",
+    "1": "phishing"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "label2id": {
+    "benign": 0,
+    "phishing": 1
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.34.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7fc8fd8ff9eb431b5876bff2e94d0ba31987fc2301942b65d1306eba9d18646
+size 1340710638

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7d104fd966c5439370d740371ebeae1a9b747a93c604762957f98ecfeec61108
+size 4536

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff