Spaces:

satyaki-mitra
/

Text_Authenticator

Sleeping

App Files Files Community

Text_Authenticator / docs /API_DOCUMENTATION.md

satyaki-mitra

Architecture updated

44d0409 9 days ago

preview code

raw

history blame contribute delete

18.7 kB

TEXT-AUTH API Documentation

Overview

The TEXT-AUTH API provides evidence-based text forensics and statistical consistency assessment through a RESTful interface. This document covers all endpoints, request/response formats, authentication, rate limiting, and integration examples.

API Version: 1.0.0

Authentication & Security
Rate Limiting
Common Response Format
Error Handling
Core Endpoints
Report Endpoints
Utility Endpoints
Best Practices

Authentication & Security

API Key Authentication

Authentication is not enforced in the current deployment. API key authentication may be added in future versions.

Rate Limiting

Rate limiting is not enforced at the application level. Deployments should use an external gateway (NGINX, API Gateway, Cloudflare) to enforce rate limits if required.

Common Response Format

All successful responses follow this structure:

{
  "status": "success",
  "analysis_id": "...",
  "detection_result": {...},
  "highlighted_html": "...",
  "reasoning": {...},
  "processing_time": 2.34,
  "timestamp": "..."
}

HTTP Status Codes

Code	Meaning	Description
200	OK	Request succeeded
201	Created	Resource created successfully
400	Bad Request	Invalid request parameters
404	Not Found	Resource not found
500	Internal Server Error	Server error
503	Service Unavailable	Service temporarily unavailable

Error Handling

Error Response Format

{
  "status": "error",
  "error": "Invalid domain...",
  "timestamp": "..."
}

Common Error Codes

Code	Description	Resolution
`TEXT_TOO_LONG`	Text exceeds maximum length (50,000 chars)	Split into multiple requests
`FILE_TOO_LARGE`	File exceeds size limit	Compress or split file
`UNSUPPORTED_FORMAT`	File format not supported	Use .txt, .pdf, .docx, .doc, or .md
`EXTRACTION_FAILED`	Document text extraction failed	Ensure file is not corrupted or password-protected
`MODEL_UNAVAILABLE`	Required model temporarily unavailable	Retry after a few minutes

Core Endpoints

Text Analysis

Endpoint: POST /api/analyze

Analyze raw text for statistical consistency patterns and forensic signals.

Request

Headers:

Content-Type: application/json

Body:

{
  "text": "Your text content here...",
  "domain": "academic",
  "enable_highlighting": true,
  "skip_expensive_metrics": false,
  "use_sentence_level": true,
  "include_metrics_summary": true,
  "generate_report": false
}

Parameters:

Parameter	Type	Required	Default	Description
`text`	string	Yes	-	Text to analyze (50-50,000 chars)
`domain`	string	No	`null` (auto-detect)	Content domain (see Domains)
`enable_highlighting`	boolean	No	`true`	Generate sentence-level highlights
`skip_expensive_metrics`	boolean	No	`false`	Skip computationally expensive metrics for faster results
`use_sentence_level`	boolean	No	`true`	Use sentence-level granularity for highlighting
`include_metrics_summary`	boolean	No	`true`	Include metric summaries in highlights
`generate_report`	boolean	No	`false`	Generate downloadable PDF/JSON report

Response

{
  "status": "success",
  "analysis_id": "analysis_1735555800000",
  "detection_result": {
    "ensemble_result": {
      "final_verdict": "Synthetic",
      "overall_confidence": 0.89,
      "synthetic_probability": 0.92,
      "authentic_probability": 0.08,
      "uncertainty_score": 0.23,
      "decision_boundary_distance": 0.42
    },
    "metric_results": {
      "perplexity": {
        "synthetic_probability": 0.94,
        "confidence": 0.91,
        "raw_score": 15.23,
        "evidence_strength": "strong"
      },
      "entropy": {
        "synthetic_probability": 0.88,
        "confidence": 0.85,
        "raw_score": 4.67,
        "evidence_strength": "moderate"
      },
      "structural": {
        "synthetic_probability": 0.91,
        "confidence": 0.87,
        "burstiness": -0.12,
        "uniformity": 0.85,
        "evidence_strength": "strong"
      },
      "linguistic": {
        "synthetic_probability": 0.86,
        "confidence": 0.82,
        "pos_diversity": 0.42,
        "mean_tree_depth": 4.2,
        "evidence_strength": "moderate"
      },
      "semantic": {
        "synthetic_probability": 0.93,
        "confidence": 0.88,
        "coherence_mean": 0.91,
        "coherence_variance": 0.03,
        "evidence_strength": "strong"
      },
      "multi_perturbation_stability": {
        "synthetic_probability": 0.89,
        "confidence": 0.84,
        "stability_score": 0.12,
        "evidence_strength": "moderate"
      }
    },
    "domain_prediction": {
      "primary_domain": "academic",
      "confidence": 0.94,
      "alternative_domains": [
        {"domain": "technical_doc", "probability": 0.23},
        {"domain": "science", "probability": 0.18}
      ]
    },
    "processed_text": {
      "word_count": 487,
      "sentence_count": 23,
      "paragraph_count": 5,
      "avg_sentence_length": 21.2,
      "language": "en"
    }
  },
  "highlighted_html": "<div class=\"text-forensics-highlight\">...</div>",
  "reasoning": {
    "summary": "The text exhibits strong statistical consistency patterns typical of language model generation...",
    "key_indicators": [
      "Unusually uniform sentence structure (burstiness: -0.12)",
      "High semantic coherence across all sentences (mean: 0.91)",
      "Low perplexity variance indicating predictable token sequences"
    ],
    "confidence_factors": {
      "supporting_evidence": [
        "6/6 metrics indicate synthetic patterns",
        "Strong cross-metric agreement (correlation: 0.87)"
      ],
      "uncertainty_sources": [
        "Domain-specific terminology may affect baseline expectations"
      ]
    },
    "metric_contributions": {
      "perplexity": 0.28,
      "entropy": 0.19,
      "structural": 0.16,
      "semantic": 0.17,
      "linguistic": 0.12,
      "multi_perturbation_stability": 0.08
    }
  },
  "report_files": null,
  "processing_time": 2.34,
  "timestamp": "2025-12-30T10:30:00Z"
}

Verdict Interpretation

Verdict	Probability Range	Interpretation
Synthetic	> 0.70	High consistency with language model generation patterns
Likely Synthetic	0.55 - 0.70	Moderate consistency with synthetic patterns
Inconclusive	0.45 - 0.55	Insufficient evidence for confident assessment
Likely Authentic	0.30 - 0.45	Moderate consistency with human authorship patterns
Authentic	< 0.30	High consistency with human authorship patterns

Important: These verdicts represent statistical consistency assessments, not definitive authorship claims.

Highlighting Color Key

Color	Meaning	Probability Range
🔴 Red	Strong synthetic signals	> 0.80
🟠 Orange	Moderate synthetic signals	0.60 - 0.80
🟡 Yellow	Weak signals	0.40 - 0.60
🟢 Green	Authentic signals	< 0.40

File Analysis

Endpoint: POST /api/analyze/file

Analyze uploaded documents (PDF, DOCX, DOC, TXT, MD).

Request

Headers:

Content-Type: multipart/form-data

Body (form-data):

file: [binary file data]
domain: "academic"
skip_expensive_metrics: false
use_sentence_level: true
include_metrics_summary: true
generate_report: false

Parameters:

Parameter	Type	Required	Default	Description
`file`	file	Yes	-	Document file (max 25MB)
`domain`	string	No	`null`	Content domain override
`skip_expensive_metrics`	boolean	No	`false`	Skip expensive metrics
`use_sentence_level`	boolean	No	`true`	Sentence-level highlighting
`include_metrics_summary`	boolean	No	`true`	Include metric summaries
`generate_report`	boolean	No	`false`	Generate report

Supported File Formats

Format	Extensions	Max Size	Notes
Plain Text	.txt, .md	25MB	UTF-8 encoding recommended
PDF	.pdf	25MB	Text-based PDFs; OCR not supported
Word	.docx, .doc	25MB	Modern and legacy formats

Response

Same structure as Text Analysis with additional file_info:

{
  "status": "success",
  "analysis_id": "file_1735555800000",
  "file_info": {
    "filename": "research_paper.pdf",
    "file_type": ".pdf",
    "pages": 12,
    "extraction_method": "pdfplumber",
    "highlighted_html": true
  },
  "detection_result": { /* same as text analysis */ },
  "highlighted_html": "...",
  "reasoning": { /* same as text analysis */ },
  "processing_time": 4.12,
  "timestamp": "2025-12-30T10:30:00Z"
}

cURL Example

curl -X POST https://your-domain.com/api/analyze/file \
  -F "file=@/path/to/document.pdf" \
  -F "domain=academic" \
  -F "generate_report=true"

Batch Analysis

Endpoint: POST /api/analyze/batch

Analyze multiple texts in a single request for efficiency.

Request

{
  "texts": [
    "First text to analyze...",
    "Second text to analyze...",
    "Third text to analyze..."
  ],
  "domain": "academic",
  "skip_expensive_metrics": true,
  "generate_reports": false
}

Parameters:

Parameter	Type	Required	Default	Description
`texts`	array[string]	Yes	-	1-100 texts to analyze
`domain`	string	No	`null`	Apply same domain to all texts
`skip_expensive_metrics`	boolean	No	`true`	Skip expensive metrics (recommended for batch)
`generate_reports`	boolean	No	`false`	Generate reports for each text

Response

{
  "status": "success",
  "batch_id": "batch_1735555800000",
  "total": 3,
  "successful": 3,
  "failed": 0,
  "results": [
    {
      "index": 0,
      "status": "success",
      "detection": {
        "ensemble_result": { /* ... */ },
        "metric_results": { /* ... */ }
      },
      "reasoning": { /* ... */ },
      "report_files": null
    },
    {
      "index": 1,
      "status": "success",
      "detection": { /* ... */ }
    },
    {
      "index": 2,
      "status": "error",
      "error": "Text too short (minimum 50 characters)"
    }
  ],
  "processing_time": 8.92,
  "timestamp": "2025-12-30T10:30:00Z"
}

Performance Tips

Set skip_expensive_metrics: true for faster batch processing
Keep batch size under 50 texts for optimal performance
Consider parallel API calls for batches > 100 texts
Monitor processing_time to adjust batch sizes

Report Endpoints

Generate Report

Endpoint: POST /api/report/generate

Generate detailed PDF/JSON reports for cached analyses.

Request

Headers:

Content-Type: application/x-www-form-urlencoded

Body:

analysis_id=analysis_1735555800000
formats=json,pdf
include_highlights=true

Parameters:

Parameter	Type	Required	Default	Description
`analysis_id`	string	Yes	-	Analysis ID from previous request
`formats`	string	No	`"json,pdf"`	Comma-separated formats
`include_highlights`	boolean	No	`true`	Include sentence highlights in report

Response

{
  "status": "success",
  "analysis_id": "analysis_1735555800000",
  "reports": {
    "json": "analysis_1735555800000.json",
    "pdf": "analysis_1735555800000.pdf"
  },
  "timestamp": "2025-12-30T10:30:00Z"
}

Download Report

Endpoint: GET /api/report/download/{filename}

Download a generated report file.

Request

GET /api/report/download/analysis_1735555800000.pdf

Response

Binary file download with appropriate Content-Type header.

Headers:

Content-Type: application/pdf
Content-Disposition: attachment; filename="analysis_1735555800000.pdf"
Content-Length: 524288

Utility Endpoints

Health Check

Endpoint: GET /health

Check API health and model availability.

Response

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime": 86400.5,
  "models_loaded": {
    "orchestrator": true,
    "highlighter": true,
    "reporter": true,
    "reasoning_generator": true,
    "document_extractor": true,
    "analysis_cache": true,
    "parallel_executor": true
  }
}

List Domains

Endpoint: GET /api/domains

Get all supported content domains with descriptions.

Response

{
  "domains": [
    {
      "value": "general",
      "name": "General",
      "description": "General-purpose text without domain-specific structure"
    },
    {
      "value": "academic",
      "name": "Academic",
      "description": "Academic papers, essays, research"
    },
    {
      "value": "creative",
      "name": "Creative",
      "description": "Creative writing, fiction, poetry"
    },
    {
      "value": "technical_doc",
      "name": "Technical Doc",
      "description": "Technical documentation, manuals, specs"
    }
    // ... 12 more domains
  ]
}

Supported Domains

Domain	Use Cases	Threshold Adjustments
`general`	Default fallback	Balanced weights
`academic`	Research papers, essays	Higher linguistic weight
`creative`	Fiction, poetry	Higher entropy/structural
`ai_ml`	ML papers, technical AI content	Semantic prioritized
`software_dev`	Code docs, READMEs	Structural relaxed
`technical_doc`	Manuals, specs	Higher semantic weight
`engineering`	Technical reports	Balanced technical focus
`science`	Scientific papers	Academic-like calibration
`business`	Reports, proposals	Formal structure emphasis
`legal`	Contracts, court filings	Strict structural patterns
`medical`	Clinical notes, research	Domain-specific terminology
`journalism`	News articles	Balanced, lower burstiness
`marketing`	Ad copy, campaigns	Creative elements
`social_media`	Posts, casual writing	Relaxed metrics, high perplexity weight
`blog_personal`	Personal blogs, diaries	Creative + casual mix
`tutorial`	How-to guides	Instructional patterns

Cache Statistics

Endpoint: GET /api/cache/stats

Get analysis cache statistics (admin only).

Response

{
  "cache_size": 42,
  "max_size": 100,
  "ttl_seconds": 3600
}

Clear Cache

Endpoint: POST /api/cache/clear

Clear analysis cache (admin only).

Response

{
  "status": "success",
  "message": "Cache cleared"
}

Best Practices

Optimization Tips

Domain Selection
- Always specify domain when known for better accuracy
- Use /api/domains to explore available options
- Let system auto-detect only when domain is truly unknown
Performance
- Set skip_expensive_metrics: true for faster results when speed matters
- Use batch API for multiple texts instead of sequential single requests
- Cache analysis_id to regenerate reports without reanalysis
Accuracy
- Provide clean, well-formatted text (remove excessive whitespace)
- Minimum 100 words recommended for reliable results
- Avoid mixing languages in single analysis
Rate Limiting
- Implement exponential backoff on 429 responses
- Monitor X-RateLimit-Remaining header
- Upgrade tier if consistently hitting limits
Error Handling
- Always check status field in response
- Log request_id for support requests
- Implement retry logic with jitter for transient errors

Security Recommendations

API Key Management
- Rotate keys every 90 days
- Use separate keys for dev/staging/production
- Revoke compromised keys immediately
Data Privacy
- Never send PII unless absolutely necessary
- Use client-side redaction before API calls
- Enable data retention policies in dashboard
Input Validation
- Sanitize user input before sending to API
- Validate file types client-side
- Implement size limits before upload

Version History:

1.0.0 (2025-12-30): Initial release
- 6 forensic metrics
- 16 domain support
- PDF/JSON reporting
- Batch processing

Appendix

Complete Domain List with Aliases

DOMAIN_ALIASES = {
    'general': ['default', 'generic'],
    'academic': ['education', 'research', 'scholarly', 'university'],
    'creative': ['fiction', 'literature', 'story', 'narrative'],
    'ai_ml': ['ai', 'ml', 'machinelearning', 'neural'],
    'software_dev': ['software', 'code', 'programming', 'dev'],
    'technical_doc': ['technical', 'tech', 'documentation', 'manual'],
    'engineering': ['engineer'],
    'science': ['scientific'],
    'business': ['corporate', 'commercial', 'enterprise'],
    'legal': ['law', 'contract', 'court'],
    'medical': ['healthcare', 'clinical', 'medicine', 'health'],
    'journalism': ['news', 'reporting', 'media', 'press'],
    'marketing': ['advertising', 'promotional', 'brand', 'sales'],
    'social_media': ['social', 'casual', 'informal', 'posts'],
    'blog_personal': ['blog', 'personal', 'diary', 'lifestyle'],
    'tutorial': ['guide', 'howto', 'instructional', 'walkthrough']
}

Metric Weight Defaults

DEFAULT_WEIGHTS = {
    'perplexity': 0.25,
    'entropy': 0.20,
    'structural': 0.15,
    'semantic': 0.15,
    'linguistic': 0.15,
    'multi_perturbation_stability': 0.10
}

Response Time Estimates

Operation	Min	Avg	Max	P95
Text Analysis (500 words)	1.2s	2.3s	4.5s	3.8s
File Analysis (PDF, 10 pages)	2.5s	4.1s	8.2s	6.9s
Batch (10 texts)	5.8s	9.2s	15.3s	13.1s
Report Generation	0.3s	0.8s	2.1s	1.5s

Last Updated: December 30, 2025
API Version: 1.0.0
Documentation Version: 1.0.0

TEXT-AUTH API Documentation

Overview

Table of Contents

Authentication & Security

API Key Authentication

Rate Limiting

Common Response Format

HTTP Status Codes

Error Handling

Error Response Format

Common Error Codes

Core Endpoints

Text Analysis

Request

Response

Verdict Interpretation

Highlighting Color Key

File Analysis

Request

Supported File Formats

Response

cURL Example

Batch Analysis

Request

Response

Performance Tips

Report Endpoints

Generate Report

Request

Response

Download Report

Request

Response

Utility Endpoints

Health Check

Response

List Domains

Response

Supported Domains

Cache Statistics

Response

Clear Cache

Response

Best Practices

Optimization Tips

Security Recommendations

Version History:

Appendix

Complete Domain List with Aliases

Metric Weight Defaults

Response Time Estimates