| # TEXT-AUTH: System Architecture Documentation | |
| > TEXT-AUTH is an evidence-first, domain-aware AI text detection system | |
| > designed around independent signals, calibrated aggregation, and | |
| > explainability rather than black-box classification. | |
| --- | |
| ## Table of Contents | |
| 1. [System Overview](#system-overview) | |
| 2. [High-Level Architecture](#high-level-architecture) | |
| 3. [Layer-by-Layer Architecture](#layer-by-layer-architecture) | |
| 4. [Data Flow](#data-flow) | |
| 5. [Technology Stack](#technology-stack) | |
| --- | |
| ## System Overview | |
| **TEXT-AUTH** is a sophisticated AI text detection system that employs multiple machine learning metrics and ensemble methods to determine whether text is synthetically generated, authentically written, or hybrid content. | |
| ### Key Capabilities | |
| - **Multi-Metric Analysis**: 6 independent detection metrics (Structural, Perplexity, Entropy, Semantic, Linguistic, Multi-Perturbation Stability) | |
| - **Domain-Aware Calibration**: Adaptive thresholds for 16 text domains (Academic, Creative, Technical, etc.) | |
| - **Ensemble Aggregation**: Confidence-weighted combination with uncertainty quantification | |
| - **Sentence-Level Highlighting**: Visual feedback with probability scores | |
| - **Comprehensive Reporting**: JSON and PDF reports with detailed analysis | |
| ### Design Principles | |
| - **Modular Architecture**: Clean separation of concerns across layers | |
| - **Fail-Safe Design**: Graceful degradation with fallback strategies | |
| - **Parallel Processing**: Multi-threaded metric execution for performance | |
| - **Domain Expertise**: Specialized thresholds calibrated per content type | |
| ## Why Multi-Metric Instead of a Single Classifier? | |
| - Single classifiers overfit stylistic artifacts | |
| - LLMs rapidly adapt to detectors | |
| - Independent statistical signals decay slower | |
| - Ensemble disagreement is itself evidence | |
| --- | |
| ## High-Level Architecture | |
| ```mermaid | |
| graph TB | |
| subgraph "Presentation Layer" | |
| UI[Web Interface/API] | |
| end | |
| subgraph "Application Layer" | |
| ORCH[Detection Orchestrator] | |
| ORCH --> |coordinates| PIPE[Processing Pipeline] | |
| end | |
| subgraph "Service Layer" | |
| ENSEMBLE[Ensemble Classifier] | |
| HIGHLIGHT[Text Highlighter] | |
| REASON[Reasoning Generator] | |
| REPORT[Report Generator] | |
| end | |
| subgraph "Processing Layer" | |
| EXTRACT[Document Extractor] | |
| TEXTPROC[Text Processor] | |
| DOMAIN[Domain Classifier] | |
| LANG[Language Detector] | |
| end | |
| subgraph "Metrics Layer" | |
| STRUCT[Structural Metric] | |
| PERP[Perplexity Metric] | |
| ENT[Entropy Metric] | |
| SEM[Semantic Metric] | |
| LING[Linguistic Metric] | |
| MPS[Multi-Perturbation Stability] | |
| end | |
| subgraph "Model Layer" | |
| MANAGER[Model Manager] | |
| REGISTRY[Model Registry] | |
| CACHE[(Model Cache)] | |
| end | |
| subgraph "Configuration Layer" | |
| CONFIG[Settings] | |
| ENUMS[Enums] | |
| SCHEMAS[Data Schemas] | |
| CONSTANTS[Constants] | |
| THRESHOLDS[Domain Thresholds] | |
| end | |
| UI --> ORCH | |
| ORCH --> EXTRACT | |
| ORCH --> TEXTPROC | |
| ORCH --> DOMAIN | |
| ORCH --> LANG | |
| ORCH --> STRUCT | |
| ORCH --> PERP | |
| ORCH --> ENT | |
| ORCH --> SEM | |
| ORCH --> LING | |
| ORCH --> MPS | |
| ORCH --> ENSEMBLE | |
| ENSEMBLE --> HIGHLIGHT | |
| ENSEMBLE --> REASON | |
| ENSEMBLE --> REPORT | |
| STRUCT --> MANAGER | |
| PERP --> MANAGER | |
| ENT --> MANAGER | |
| SEM --> MANAGER | |
| LING --> MANAGER | |
| MPS --> MANAGER | |
| DOMAIN --> MANAGER | |
| LANG --> MANAGER | |
| MANAGER --> REGISTRY | |
| MANAGER --> CACHE | |
| ORCH --> CONFIG | |
| ENSEMBLE --> THRESHOLDS | |
| style UI fill:#e1f5ff | |
| style ORCH fill:#fff3e0 | |
| style ENSEMBLE fill:#f3e5f5 | |
| style MANAGER fill:#e8f5e9 | |
| style CONFIG fill:#fce4ec | |
| ``` | |
| --- | |
| ## Layer-by-Layer Architecture | |
| ### 1. Configuration Layer (`config/`) | |
| The foundation layer providing enums, schemas, constants, and domain-specific thresholds. | |
| ```mermaid | |
| graph LR | |
| subgraph "Configuration Layer" | |
| direction TB | |
| ENUMS["enums.py | |
| Domain, Language, Script, | |
| ModelType ConfidenceLevel"] | |
| SCHEMAS["schemas.py | |
| ModelConfig, ProcessedText, MetricResult, EnsembleResult, | |
| DetectionResult"] | |
| CONSTANTS["constants.py | |
| TextProcessingParams, MetricParams, | |
| EnsembleParams"] | |
| THRESHOLDS["threshold_config.py | |
| DomainThresholds 16, | |
| Domain Configs MetricThresholds"] | |
| MODELCFG["model_config.py | |
| Model Registry, Model Groups, Default Weights"] | |
| SETTINGS["settings.py | |
| App Settings, Paths, Feature Flags"] | |
| end | |
| ENUMS -.->|used by| SCHEMAS | |
| ENUMS -.->|used by| THRESHOLDS | |
| SCHEMAS -.->|used by| CONSTANTS | |
| THRESHOLDS -.->|imports| ENUMS | |
| MODELCFG -.->|imports| ENUMS | |
| style ENUMS fill:#ffebee | |
| style SCHEMAS fill:#fff3e0 | |
| style CONSTANTS fill:#e8f5e9 | |
| style THRESHOLDS fill:#e1f5ff | |
| style MODELCFG fill:#f3e5f5 | |
| style SETTINGS fill:#fce4ec | |
| ``` | |
| **Key Components:** | |
| - **enums.py**: Core enumerations (Domain, Language, Script, ModelType, ConfidenceLevel) | |
| - **schemas.py**: Data classes for structured data exchange | |
| - **constants.py**: Frozen dataclasses with hyperparameters for each metric | |
| - **threshold_config.py**: Domain-specific thresholds for 16 domains | |
| - **model_config.py**: Model registry with download priorities and configurations | |
| - **settings.py**: Application settings with Pydantic validation | |
| --- | |
| ### 2. Model Abstraction Layer (`models/`) | |
| Conceptual model abstraction layer used by metrics for centralized loading and reuse - loading, caching, and providing unified access. | |
| ```mermaid | |
| graph TB | |
| subgraph "Model Layer" | |
| direction TB | |
| MANAGER["Model Manager | |
| Singleton Pattern Lazy Loading"] | |
| REGISTRY["Model Registry | |
| 10 Model Configs Priority Groups"] | |
| subgraph "Model Cache" | |
| direction LR | |
| GPT2[GPT-2548MBPerplexity/MPS] | |
| MINILM[MiniLM-L6-v280MBSemantic] | |
| SPACY[spaCy sm13MBLinguistic] | |
| ROBERTA[RoBERTa500MBDomain Classifier] | |
| DISTIL[DistilRoBERTa330MBMPS Mask] | |
| XLM[XLM-RoBERTa1100MBLanguage Detection] | |
| end | |
| STATS[Usage StatisticsTracking Performance Metrics] | |
| end | |
| MANAGER -->|loads from| REGISTRY | |
| MANAGER -->|manages| GPT2 | |
| MANAGER -->|manages| MINILM | |
| MANAGER -->|manages| SPACY | |
| MANAGER -->|manages| ROBERTA | |
| MANAGER -->|manages| DISTIL | |
| MANAGER -->|manages| XLM | |
| MANAGER -->|tracks| STATS | |
| REGISTRY -.->|defines| GPT2 | |
| REGISTRY -.->|defines| MINILM | |
| REGISTRY -.->|defines| SPACY | |
| style MANAGER fill:#e3f2fd | |
| style REGISTRY fill:#f3e5f5 | |
| style STATS fill:#fff3e0 | |
| ``` | |
| **Key Features:** | |
| - **Lazy Loading**: Models loaded on-demand | |
| - **Caching Strategy**: LRU cache with max 5 models | |
| - **Usage Tracking**: Statistics for optimization | |
| - **Priority Groups**: Essential, Extended, Optional | |
| - **Total Size**: ~2.8GB for all models | |
| --- | |
| ### 3. Processing Layer (`processors/`) | |
| Handles document extraction, text preprocessing, domain identification, and language detection. | |
| ```mermaid | |
| graph TB | |
| subgraph "Processing Layer" | |
| direction TB | |
| subgraph "Document Extraction" | |
| EXTRACT[Document Extractor] | |
| EXTRACT -->|PDF| PYPDF[PyMuPDF Primary] | |
| EXTRACT -->|PDF| PDFPLUMB[pdfplumber Fallback] | |
| EXTRACT -->|PDF| PYPDF2[PyPDF2 Fallback] | |
| EXTRACT -->|DOCX| DOCX[python-docx] | |
| EXTRACT -->|HTML| BS4[BeautifulSoup4] | |
| EXTRACT -->|RTF| RTF[Basic Parser] | |
| EXTRACT -->|TXT| TXT[Chardet Encoding] | |
| end | |
| subgraph "Text Processing" | |
| TEXTPROC[Text Processor] | |
| TEXTPROC --> CLEAN[Unicode NormalizationURL/Email RemovalWhitespace Cleaning] | |
| TEXTPROC --> SPLIT[Smart Sentence SplittingAbbreviation HandlingWord Tokenization] | |
| TEXTPROC --> VALIDATE[Length ValidationQuality ChecksStatistics] | |
| end | |
| subgraph "Domain Classification" | |
| DOMAIN[Domain Classifier] | |
| DOMAIN --> ZERO[Heuristic + optional model-assisted domain inference RoBERTa/DeBERTa] | |
| DOMAIN --> LABELS[16 Domain LabelsMulti-Label Candidates] | |
| DOMAIN --> THRESH[Domain-SpecificThreshold Selection] | |
| end | |
| subgraph "Language Detection" | |
| LANG[Language Detector] | |
| LANG --> MODEL[XLM-RoBERTaChunk-Based Analysis] | |
| LANG --> FALLBACK[langdetect Library] | |
| LANG --> HEURISTIC[Script DetectionCharacter Analysis] | |
| end | |
| end | |
| EXTRACT -->|ProcessedText| TEXTPROC | |
| TEXTPROC -->|Cleaned Text| DOMAIN | |
| TEXTPROC -->|Cleaned Text| LANG | |
| style EXTRACT fill:#e8f5e9 | |
| style TEXTPROC fill:#fff3e0 | |
| style DOMAIN fill:#e1f5ff | |
| style LANG fill:#f3e5f5 | |
| ``` | |
| **Processing Pipeline:** | |
| 1. **Document Extraction**: Multi-format support with fallback strategies | |
| 2. **Text Cleaning**: Unicode normalization, noise removal, validation | |
| 3. **Domain Identification**: Zero-shot classification with confidence scores | |
| 4. **Language Detection**: Multi-strategy approach with script analysis | |
| --- | |
| ### 4. Metrics Layer (`metrics/`) | |
| Six independent detection metrics analyzing different text characteristics. | |
| ```mermaid | |
| graph TB | |
| subgraph "Metrics Layer" | |
| direction TB | |
| BASE[Base MetricAbstract ClassCommon Interface] | |
| subgraph "Statistical Metrics" | |
| STRUCT[Structural MetricNo ML ModelStatistical Features] | |
| STRUCT --> SF1[Sentence Length DistributionBurstiness ScoreReadability] | |
| STRUCT --> SF2[N-gram DiversityType-Token RatioRepetition Patterns] | |
| end | |
| subgraph "ML-Based Metrics" | |
| PERP[Perplexity MetricGPT-2 ModelText Predictability] | |
| PERP --> PF1[Overall PerplexitySentence-Level PerplexityCross-Entropy] | |
| PERP --> PF2[Chunk AnalysisVariance ScoringNormalization] | |
| ENT[Entropy MetricGPT-2 TokenizerRandomness Analysis] | |
| ENT --> EF1[Character EntropyWord EntropyToken Entropy] | |
| ENT --> EF2[Token DiversitySequence UnpredictabilityPattern Detection] | |
| SEM[Semantic MetricMiniLM EmbeddingsCoherence Analysis] | |
| SEM --> SF3[Sentence SimilarityTopic ConsistencyCoherence Score] | |
| SEM --> SF4[Repetition DetectionTopic DriftContextual Consistency] | |
| LING[Linguistic MetricspaCy NLPGrammar Analysis] | |
| LING --> LF1[POS DiversityPOS EntropySyntactic Complexity] | |
| LING --> LF2[Grammatical PatternsWriting StylePattern Detection] | |
| MPS[Multi-PerturbationGPT-2 + DistilRoBERTaStability Analysis] | |
| MPS --> MF1[Text PerturbationLikelihood CalculationStability Score] | |
| MPS --> MF2[Curvature AnalysisChunk StabilityVariance Scoring] | |
| end | |
| end | |
| BASE -.->|inherited by| STRUCT | |
| BASE -.->|inherited by| PERP | |
| BASE -.->|inherited by| ENT | |
| BASE -.->|inherited by| SEM | |
| BASE -.->|inherited by| LING | |
| BASE -.->|inherited by| MPS | |
| style BASE fill:#ffebee | |
| style STRUCT fill:#e8f5e9 | |
| style PERP fill:#fff3e0 | |
| style ENT fill:#e1f5ff | |
| style SEM fill:#f3e5f5 | |
| style LING fill:#fce4ec | |
| style MPS fill:#fff9c4 | |
| ``` | |
| **Metric Characteristics:** | |
| | Metric | Model Required | Complexity | Typical Influence Range (Indicative) | | |
| |--------|---------------|------------|--------------| | |
| | Structural | ❌ | Low | 15-20% | | |
| | Perplexity | GPT-2 | Medium | 20-27% | | |
| | Entropy | GPT-2 Tokenizer | Medium | 13-17% | | |
| | Semantic | MiniLM | Medium | 18-20% | | |
| | Linguistic | spaCy | Medium | 12-16% | | |
| | MPS | GPT-2 + DistilRoBERTa | High | 8-10% | | |
| > *Actual weights are dynamically calibrated per domain and configuration.* | |
| --- | |
| ### 5. Service Layer (`services/`) | |
| Coordinates ensemble aggregation, highlighting, reasoning generation, and orchestration. | |
| ```mermaid | |
| graph TB | |
| subgraph "Service Layer" | |
| direction TB | |
| subgraph "Orchestrator" | |
| ORCH[Detection OrchestratorPipeline Coordinator] | |
| ORCH --> PIPE[Processing Pipeline6-Step Execution] | |
| PIPE --> STEP1[1. Text Preprocessing] | |
| PIPE --> STEP2[2. Language Detection] | |
| PIPE --> STEP3[3. Domain Classification] | |
| PIPE --> STEP4[4. Metric ExecutionParallel/Sequential] | |
| PIPE --> STEP5[5. Ensemble Aggregation] | |
| PIPE --> STEP6[6. Result Compilation] | |
| end | |
| subgraph "Ensemble Classifier" | |
| ENSEMBLE[Ensemble ClassifierMulti-Strategy Aggregation] | |
| ENSEMBLE --> METHOD1[Confidence CalibratedSigmoid Weighting] | |
| ENSEMBLE --> METHOD2[Consensus BasedAgreement Rewards] | |
| ENSEMBLE --> METHOD3[Domain WeightedStatic Weights] | |
| ENSEMBLE --> METHOD4[Simple AverageFallback] | |
| ENSEMBLE --> CALC[Uncertainty QuantificationConsensus AnalysisConfidence Scoring] | |
| end | |
| subgraph "Highlighter" | |
| HIGHLIGHT[Text HighlighterSentence-Level Analysis] | |
| HIGHLIGHT --> COLORS[4-Color SystemAuthentic/UncertainHybrid/Synthetic] | |
| HIGHLIGHT --> SENTENCE[Sentence EnsembleDomain AdjustmentsTooltip Generation] | |
| end | |
| subgraph "Reasoning" | |
| REASON[Reasoning GeneratorExplainable AI] | |
| REASON --> SUMMARY[Executive SummaryVerdict Explanation] | |
| REASON --> INDICATORS[Key IndicatorsMetric Breakdown] | |
| REASON --> EVIDENCE[Supporting EvidenceContradicting Evidence] | |
| REASON --> RECOM[RecommendationsUncertainty Analysis] | |
| end | |
| end | |
| ORCH -->|coordinates| ENSEMBLE | |
| ORCH -->|uses| HIGHLIGHT | |
| ORCH -->|uses| REASON | |
| ENSEMBLE -->|provides| HIGHLIGHT | |
| ENSEMBLE -->|provides| REASON | |
| style ORCH fill:#fff3e0 | |
| style ENSEMBLE fill:#e3f2fd | |
| style HIGHLIGHT fill:#f3e5f5 | |
| style REASON fill:#e8f5e9 | |
| ``` | |
| **Service Features:** | |
| - **Parallel Execution**: ThreadPoolExecutor for metric computation | |
| - **Ensemble Methods**: 4 aggregation strategies with fallbacks | |
| - **Sentence Highlighting**: 4-category color system (Authentic/Uncertain/Hybrid/Synthetic) | |
| - **Explainable AI**: Detailed reasoning with metric contributions | |
| --- | |
| ### 6. Reporter Layer (`reporter/`) | |
| Generates comprehensive reports in multiple formats. | |
| ```mermaid | |
| graph TB | |
| subgraph "Reporter Layer" | |
| direction TB | |
| REPORT[Report Generator] | |
| subgraph "JSON Report" | |
| JSON[Structured JSON] | |
| JSON --> META[Report MetadataTimestampVersion] | |
| JSON --> RESULTS[Overall ResultsProbabilitiesConfidence] | |
| JSON --> METRICS[Detailed MetricsSub-metricsWeights] | |
| JSON --> REASONING[Detection ReasoningEvidenceRecommendations] | |
| JSON --> HIGHLIGHT[Highlighted SentencesColor ClassesProbabilities] | |
| JSON --> PERF[Performance MetricsExecution TimesWarnings/Errors] | |
| end | |
| subgraph "PDF Report" | |
| PDF[Professional PDF] | |
| PDF --> PAGE1[Page 1: Executive SummaryVerdict, Stats, Reasoning] | |
| PDF --> PAGE2[Page 2: Content AnalysisDomain, Metrics, Weights] | |
| PDF --> PAGE3[Page 3: Structural & Entropy] | |
| PDF --> PAGE4[Page 4: Perplexity & Semantic] | |
| PDF --> PAGE5[Page 5: Linguistic & MPS] | |
| PDF --> PAGE6[Page 6: Recommendations] | |
| STYLE[Premium Styling] | |
| STYLE --> COLORS[Color SchemeBlue/Green/Red/Purple] | |
| STYLE --> TABLES[Professional TablesCharts, Metrics] | |
| STYLE --> LAYOUT[Multi-Page LayoutHeaders, Footers] | |
| end | |
| end | |
| REPORT -->|generates| JSON | |
| REPORT -->|generates| PDF | |
| PDF -->|uses| STYLE | |
| style REPORT fill:#fff3e0 | |
| style JSON fill:#e8f5e9 | |
| style PDF fill:#e3f2fd | |
| style STYLE fill:#f3e5f5 | |
| ``` | |
| **Report Formats:** | |
| - **JSON**: Machine-readable with complete data | |
| - **PDF**: Human-readable with professional formatting | |
| - **Charts**: Pie charts for probability distribution | |
| - **Tables**: Metric contributions, detailed sub-metrics | |
| - **Styling**: Color-coded, multi-page layout with branding | |
| --- | |
| ## Data Flow | |
| ### Complete Detection Pipeline | |
| ```mermaid | |
| sequenceDiagram | |
| participant User | |
| participant Orchestrator | |
| participant Processors | |
| participant Metrics | |
| participant Ensemble | |
| participant Services | |
| participant Reporter | |
| User->>Orchestrator: analyze(text) | |
| Note over Orchestrator: Step 1: Preprocessing | |
| Orchestrator->>Processors: TextProcessor.process() | |
| Processors-->>Orchestrator: ProcessedText | |
| Note over Orchestrator: Step 2: Language Detection | |
| Orchestrator->>Processors: LanguageDetector.detect() | |
| Processors-->>Orchestrator: LanguageResult | |
| Note over Orchestrator: Step 3: Domain Classification | |
| Orchestrator->>Processors: DomainClassifier.classify() | |
| Processors-->>Orchestrator: DomainPrediction | |
| Note over Orchestrator: Step 4: Parallel Metric Execution | |
| par Structural | |
| Orchestrator->>Metrics: Structural.compute() | |
| Metrics-->>Orchestrator: MetricResult | |
| and Perplexity | |
| Orchestrator->>Metrics: Perplexity.compute() | |
| Metrics-->>Orchestrator: MetricResult | |
| and Entropy | |
| Orchestrator->>Metrics: Entropy.compute() | |
| Metrics-->>Orchestrator: MetricResult | |
| and Semantic | |
| Orchestrator->>Metrics: Semantic.compute() | |
| Metrics-->>Orchestrator: MetricResult | |
| and Linguistic | |
| Orchestrator->>Metrics: Linguistic.compute() | |
| Metrics-->>Orchestrator: MetricResult | |
| and MPS | |
| Orchestrator->>Metrics: MPS.compute() | |
| Metrics-->>Orchestrator: MetricResult | |
| end | |
| Note over Orchestrator: Step 5: Ensemble Aggregation | |
| Orchestrator->>Ensemble: predict(metric_results, domain) | |
| Ensemble-->>Orchestrator: EnsembleResult | |
| Note over Orchestrator: Step 6: Services | |
| Orchestrator->>Services: generate_highlights() | |
| Services-->>Orchestrator: HighlightedSentences | |
| Orchestrator->>Services: generate_reasoning() | |
| Services-->>Orchestrator: DetailedReasoning | |
| Orchestrator->>Reporter: generate_report() | |
| Reporter-->>Orchestrator: Report Files | |
| Orchestrator-->>User: DetectionResult | |
| ``` | |
| ### Ensemble Aggregation Flow | |
| ```mermaid | |
| graph TD | |
| START[Metric Results] --> FILTER[Filter Valid MetricsRemove Errors] | |
| FILTER --> WEIGHTS[Get Domain WeightsBase Weights] | |
| WEIGHTS --> METHOD{Primary Method?} | |
| METHOD -->|Confidence Calibrated| CONF[Sigmoid ConfidenceAdjustment] | |
| METHOD -->|Consensus Based| CONS[AgreementCalculation] | |
| METHOD -->|Domain Weighted| DOMAIN[Static DomainWeights] | |
| CONF --> AGGREGATE[Weighted Aggregation] | |
| CONS --> AGGREGATE | |
| DOMAIN --> AGGREGATE | |
| AGGREGATE --> NORMALIZE[Normalize to 1.0] | |
| NORMALIZE --> CALC[Calculate Metrics] | |
| CALC --> CONFIDENCE[Overall ConfidenceBase + Agreement+ Certainty + Quality] | |
| CALC --> UNCERTAINTY[Uncertainty ScoreVariance + Confidence+ Decision] | |
| CALC --> CONSENSUS[Consensus LevelStd Dev Analysis] | |
| CONFIDENCE --> THRESHOLD[Apply AdaptiveThreshold] | |
| UNCERTAINTY --> THRESHOLD | |
| THRESHOLD --> VERDICT{Verdict} | |
| VERDICT -->|Synthetic >= 0.6| SYNTH[Synthetically-Generated] | |
| VERDICT -->|Authentic >= 0.6| AUTH[Authentically-Written] | |
| VERDICT -->|Hybrid > 0.25| HYBRID[Hybrid] | |
| VERDICT -->|Uncertain| UNC[Uncertain] | |
| SYNTH --> REASON[Generate Reasoning] | |
| AUTH --> REASON | |
| HYBRID --> REASON | |
| UNC --> REASON | |
| REASON --> RESULT[EnsembleResult] | |
| style START fill:#e8f5e9 | |
| style RESULT fill:#e3f2fd | |
| style SYNTH fill:#ffebee | |
| style AUTH fill:#e8f5e9 | |
| style HYBRID fill:#fff3e0 | |
| style UNC fill:#f5f5f5 | |
| ``` | |
| --- | |
| ## Technology Stack | |
| ### Core Technologies | |
| ```mermaid | |
| graph LR | |
| subgraph "Language & Runtime" | |
| PYTHON[Python 3.10+] | |
| CONDA[Conda Environment] | |
| end | |
| subgraph "ML Frameworks" | |
| TORCH[PyTorch] | |
| HF[HuggingFace Transformers] | |
| SPACY[spaCy] | |
| SKLEARN[scikit-learn] | |
| end | |
| subgraph "NLP Models" | |
| GPT2[GPT-2Perplexity/MPS] | |
| MINILM[MiniLM-L6-v2Semantic] | |
| ROBERTA[RoBERTaDomain Classify] | |
| DISTIL[DistilRoBERTaMPS Mask] | |
| XLM[XLM-RoBERTaLanguage Detect] | |
| SPACYMODEL[en_core_web_smLinguistic] | |
| end | |
| subgraph "Document Processing" | |
| PYMUPDF[PyMuPDF] | |
| PDFPLUMBER[pdfplumber] | |
| PYPDF2[PyPDF2] | |
| DOCX[python-docx] | |
| BS4[BeautifulSoup4] | |
| end | |
| subgraph "Utilities" | |
| NUMPY[NumPy] | |
| PYDANTIC[Pydantic] | |
| LOGURU[Loguru] | |
| REPORTLAB[ReportLab] | |
| end | |
| PYTHON --> TORCH | |
| TORCH --> HF | |
| HF --> GPT2 | |
| HF --> MINILM | |
| HF --> ROBERTA | |
| HF --> DISTIL | |
| HF --> XLM | |
| PYTHON --> SPACY | |
| SPACY --> SPACYMODEL | |
| style PYTHON fill:#306998 | |
| style TORCH fill:#ee4c2c | |
| style HF fill:#ff6f00 | |
| style SPACY fill:#09a3d5 | |
| ``` | |
| ### Dependencies Summary | |
| | Category | Libraries | Purpose | | |
| |----------|-----------|---------| | |
| | **ML Core** | PyTorch, Transformers, spaCy | Model execution, NLP | | |
| | **Document** | PyMuPDF, pdfplumber, python-docx | Multi-format extraction | | |
| | **Analysis** | NumPy, scikit-learn | Numerical computation | | |
| | **Validation** | Pydantic | Data validation | | |
| | **Logging** | Loguru | Structured logging | | |
| | **Reporting** | ReportLab | PDF generation | | |
| --- | |
| ## Deployment Architecture | |
| ```mermaid | |
| graph TB | |
| subgraph "Deployment Options" | |
| direction TB | |
| subgraph "Standalone Application" | |
| SCRIPT[Python Scripts] | |
| end | |
| subgraph "Web Application" | |
| FASTAPI[FastAPI Server] | |
| end | |
| subgraph "API Service" | |
| REST[REST API Endpoints] | |
| BATCH[Batch Processing] | |
| ASYNC[Async Workers] | |
| end | |
| subgraph "Infrastructure" | |
| DOCKER[Docker Container] | |
| GPU[GPU SupportOptional] | |
| STORAGE[Model Cache2.8GB] | |
| end | |
| end | |
| FASTAPI --> DOCKER | |
| REST --> DOCKER | |
| DOCKER --> GPU | |
| DOCKER --> STORAGE | |
| style FASTAPI fill:#e3f2fd | |
| style DOCKER fill:#2496ed | |
| style GPU fill:#76b900 | |
| ``` | |
| ### System Requirements | |
| - **Python**: 3.10+ | |
| - **RAM**: 8GB minimum, 16GB recommended | |
| - **Storage**: 5GB (models + data) | |
| - **GPU**: Optional (CUDA/MPS for faster inference) | |
| - **CPU**: 4+ cores for parallel execution | |
| --- | |
| ## Performance Characteristics | |
| ### Execution Modes | |
| ```mermaid | |
| graph LR | |
| subgraph "Sequential Mode" | |
| S1[Metric 1] --> S2[Metric 2] | |
| S2 --> S3[Metric 3] | |
| S3 --> S4[Metric 4] | |
| S4 --> S5[Metric 5] | |
| S5 --> S6[Metric 6] | |
| S6 --> SRESULT[~15-30s] | |
| end | |
| subgraph "Parallel Mode" | |
| P1[Metric 1] | |
| P2[Metric 2] | |
| P3[Metric 3] | |
| P4[Metric 4] | |
| P5[Metric 5] | |
| P6[Metric 6] | |
| P1 --> PRESULT[~8-12s] | |
| P2 --> PRESULT | |
| P3 --> PRESULT | |
| P4 --> PRESULT | |
| P5 --> PRESULT | |
| P6 --> PRESULT | |
| end | |
| style SRESULT fill:#ffebee | |
| style PRESULT fill:#e8f5e9 | |
| ``` | |
| ### Metric Execution Times | |
| | Metric | Avg Time | Complexity | Model Size | | |
| |--------|----------|------------|------------| | |
| | Structural | 0.5-1s | Low | 0MB | | |
| | Perplexity | 2-4s | Medium | 548MB | | |
| | Entropy | 1-2s | Medium | ~50MB (shared) | | |
| | Semantic | 3-5s | Medium | 80MB | | |
| | Linguistic | 2-3s | Medium | 13MB | | |
| | MPS | 5-10s | High | 878MB (GPT-2 + DistilRoBERTa) | | |
| **Total Sequential**: ~15-25 seconds | |
| **Total Parallel**: ~8-12 seconds (limited by slowest metric) | |
| --- | |
| ## Security & Privacy | |
| ### Data Handling | |
| ```mermaid | |
| graph TD | |
| INPUT[Text Input] --> PROCESS[Processing] | |
| PROCESS --> MEMORY[In-Memory Only] | |
| MEMORY --> ANALYSIS[Analysis] | |
| ANALYSIS --> CLEANUP[Auto Cleanup] | |
| MODELS[Model Cache] -.->|Read Only| ANALYSIS | |
| REPORTS[Optional Reports] --> STORAGE[Local Storage Only] | |
| CLEANUP --> DISCARD[Data Discarded] | |
| style INPUT fill:#e3f2fd | |
| style MEMORY fill:#fff3e0 | |
| style CLEANUP fill:#e8f5e9 | |
| style DISCARD fill:#ffebee | |
| ``` | |
| ### Security Features | |
| - ✅ **No External Data Transmission**: All processing local | |
| - ✅ **No Data Persistence**: Text data not stored by default | |
| - ✅ **Model Integrity**: Checksums for downloaded models | |
| - ✅ **Input Validation**: Pydantic schemas for all inputs | |
| - ✅ **Error Isolation**: Graceful degradation, no information leakage | |
| --- | |
| > This system does not claim ground truth authorship. It estimates probabilistic authenticity signals based on measurable text properties. |