Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -9,4 +9,239 @@ app_file: app.py
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Scholar Express
|
| 13 |
+
## AI-Powered Accessible Academic Research Platform
|
| 14 |
+
|
| 15 |
+
Scholar Express is an innovative AI-powered platform that transforms inaccessible scientific research papers into interactive, screen-reader compatible documents. The system addresses critical accessibility barriers faced by students with disabilities in academic research, leveraging specialized AI models to make scientific literature truly inclusive.
|
| 16 |
+
|
| 17 |
+
## 🎯 Problem Statement
|
| 18 |
+
According to the U.S. National Center for Education Statistics, a significant portion of undergraduate students have disabilities:
|
| 19 |
+
- 18% of male undergraduate students
|
| 20 |
+
- 22% of female undergraduate students
|
| 21 |
+
- 54% of nonbinary undergraduate students
|
| 22 |
+
|
| 23 |
+
These students face major barriers when conducting research, as scientific PDFs are fundamentally inaccessible to screen readers due to complex mathematical equations, figures, and diagrams lacking alt text descriptions.
|
| 24 |
+
|
| 25 |
+
## 🚀 Key Features
|
| 26 |
+
|
| 27 |
+
### Document Processing
|
| 28 |
+
- **OCR and layout analysis** optimized for scientific papers
|
| 29 |
+
- **Table and figure extraction** with proper formatting for research content
|
| 30 |
+
- **AI-generated alt text** specifically for scientific diagrams, charts, and equations
|
| 31 |
+
- **Structured markdown output** that preserves document hierarchy
|
| 32 |
+
|
| 33 |
+
### Interactive Features
|
| 34 |
+
- **RAG-powered chatbot** for scientific document Q&A
|
| 35 |
+
- **Real-time voice conversations** about research content
|
| 36 |
+
- **Multi-tab interface** optimized for research workflows
|
| 37 |
+
|
| 38 |
+
### Accessibility Focus
|
| 39 |
+
- **Screen reader compatible** output
|
| 40 |
+
- **Descriptive alt text** for all figures following WCAG guidelines
|
| 41 |
+
- **Privacy-first design** with local processing
|
| 42 |
+
|
| 43 |
+
## 🏗️ System Architecture
|
| 44 |
+
|
| 45 |
+
### Core AI Models
|
| 46 |
+
The platform utilizes a specialized ensemble of AI models, each optimized for specific tasks:
|
| 47 |
+
|
| 48 |
+
- **Gemma 3n 4B**: Primary engine for alt text generation and document chatbot functionality
|
| 49 |
+
- **Gemma 3n 2B**: Specialized for real-time voice chat interactions
|
| 50 |
+
- **DOLPHIN**: Handles PDF layout analysis and text extraction
|
| 51 |
+
- **SentenceTransformer**: Enables semantic search for Retrieval-Augmented Generation (RAG)
|
| 52 |
+
|
| 53 |
+
### Processing Pipeline
|
| 54 |
+
|
| 55 |
+
#### PDF Processing
|
| 56 |
+
```
|
| 57 |
+
PDF Upload → Image Conversion → Layout Analysis → Element Extraction → Alt Text Generation → Markdown Output
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
#### Chat System
|
| 61 |
+
```
|
| 62 |
+
User Question → Document Search → Context Retrieval → AI Response (Gemma 3n 4B)
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
#### Voice System
|
| 66 |
+
```
|
| 67 |
+
Audio Input → Speech Detection → Voice Processing → Text Response → Speech Output
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## 📁 Project Structure
|
| 71 |
+
|
| 72 |
+
```
|
| 73 |
+
Scholar-Express/
|
| 74 |
+
├── 📄 Core Application Files
|
| 75 |
+
│ ├── app.py # Main Gradio application with multi-tab interface
|
| 76 |
+
│ ├── chat.py # Document chat functionality
|
| 77 |
+
│ ├── gradio_final_app.py # Final integrated Gradio application
|
| 78 |
+
│ └── gradio_local_gemma.py # Local Gemma model integration
|
| 79 |
+
│
|
| 80 |
+
├── 🔧 Configuration & Dependencies
|
| 81 |
+
│ ├── requirements.txt # Main project dependencies
|
| 82 |
+
│ ├── requirements_gemma.txt # Gemma-specific dependencies
|
| 83 |
+
│ ├── requirements_voice_gemma.txt # Voice chat dependencies
|
| 84 |
+
│ ├── requirements_hf_spaces.txt # HuggingFace Spaces deployment
|
| 85 |
+
│ ├── pyproject.toml # Project configuration (Black formatting)
|
| 86 |
+
│ └── config/
|
| 87 |
+
│ └── Dolphin.yaml # DOLPHIN model configuration
|
| 88 |
+
│
|
| 89 |
+
├── 🛠️ Utility Modules
|
| 90 |
+
│ └── utils/
|
| 91 |
+
│ ├── markdown_utils.py # Markdown processing utilities
|
| 92 |
+
│ ├── model.py # AI model management
|
| 93 |
+
│ ├── processor.py # Document processing utilities
|
| 94 |
+
│ └── utils.py # General utility functions
|
| 95 |
+
│
|
| 96 |
+
├── 🎤 Voice Chat System
|
| 97 |
+
│ └── voice_chat/
|
| 98 |
+
│ ├── app.py # Voice chat Gradio interface
|
| 99 |
+
│ ├── gemma3n_inference.py # Gemma 3n voice inference
|
| 100 |
+
│ ├── inference.py # General inference utilities
|
| 101 |
+
│ ├── server.py # Voice chat server
|
| 102 |
+
│ ├── requirements.txt # Voice-specific dependencies
|
| 103 |
+
│ ├── litgpt/ # LitGPT integration
|
| 104 |
+
│ │ ├── config.py # Model configuration
|
| 105 |
+
│ │ ├── model.py # Model architecture
|
| 106 |
+
│ │ ├── tokenizer.py # Tokenization utilities
|
| 107 |
+
│ │ └── generate/ # Text generation utilities
|
| 108 |
+
│ ├── utils/
|
| 109 |
+
│ │ ├── vad.py # Voice Activity Detection
|
| 110 |
+
│ │ ├── snac_utils.py # Audio processing utilities
|
| 111 |
+
│ │ └── assets/
|
| 112 |
+
│ │ └── silero_vad.onnx # Silero VAD model
|
| 113 |
+
│ └── data/samples/ # Audio sample outputs
|
| 114 |
+
│
|
| 115 |
+
├── 🤖 Pre-trained Models
|
| 116 |
+
│ └── hf_model/ # HuggingFace model files
|
| 117 |
+
│ ├── config.json # Model configuration
|
| 118 |
+
│ ├── model.safetensors # Model weights
|
| 119 |
+
│ ├── tokenizer.json # Tokenizer configuration
|
| 120 |
+
│ └── generation_config.json # Generation parameters
|
| 121 |
+
│
|
| 122 |
+
├── 🧪 Development & Demo Files
|
| 123 |
+
│ ├── demo_element_hf.py # Element extraction demo
|
| 124 |
+
│ ├── demo_page_hf.py # Page processing demo
|
| 125 |
+
│ ├── gradio_pdf_app.py # PDF processing demo
|
| 126 |
+
│ ├── gradio_image_app.py # Image processing demo
|
| 127 |
+
│ ├── gradio_gemma.py # Gemma integration demo
|
| 128 |
+
│ └── gradio_gemma_api.py # Gemma API demo
|
| 129 |
+
│
|
| 130 |
+
└── 📚 Documentation
|
| 131 |
+
├── README.md # This comprehensive guide
|
| 132 |
+
└── Scholar_Express_Technical_Write_Up.pdf # Detailed technical documentation
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
### 🔑 Essential Files Explained
|
| 136 |
+
|
| 137 |
+
#### Core Application
|
| 138 |
+
- **`app.py`**: Main entry point with complete Gradio interface featuring PDF processing, document chat, and voice interaction tabs
|
| 139 |
+
|
| 140 |
+
#### Configuration & Dependencies
|
| 141 |
+
- **`requirements.txt`**: Complete dependency list including PyTorch, Transformers, Gradio, PDF processing, and voice libraries
|
| 142 |
+
- **`requirements_voice_gemma.txt`**: Specialized dependencies for voice chat (LitGPT, SNAC, Whisper)
|
| 143 |
+
- **`config/Dolphin.yaml`**: Configuration file for DOLPHIN model parameters and settings
|
| 144 |
+
|
| 145 |
+
#### Utility Modules (`utils/`)
|
| 146 |
+
- **`model.py`**: AI model loading, initialization, and management functions
|
| 147 |
+
- **`processor.py`**: PDF processing, image extraction, and document parsing utilities
|
| 148 |
+
- **`markdown_utils.py`**: Markdown generation and formatting for accessible output
|
| 149 |
+
- **`utils.py`**: General helper functions for file handling and data processing
|
| 150 |
+
|
| 151 |
+
#### Voice Chat System (`voice_chat/`)
|
| 152 |
+
- **`gemma3n_inference.py`**: Core Gemma 3n 2B inference engine for voice processing
|
| 153 |
+
- **`utils/vad.py`**: Voice Activity Detection using Silero VAD model
|
| 154 |
+
- **`utils/snac_utils.py`**: Audio preprocessing and formatting utilities
|
| 155 |
+
- **`litgpt/`**: Lightweight GPT implementation for efficient voice processing
|
| 156 |
+
|
| 157 |
+
#### Model Files (`hf_model/`)
|
| 158 |
+
- **`model.safetensors`**: Pre-trained model weights in SafeTensors format
|
| 159 |
+
- **`config.json`**: Model architecture and parameter configuration
|
| 160 |
+
- **`tokenizer.json`**: Tokenization rules and vocabulary
|
| 161 |
+
|
| 162 |
+
### 📋 Dependency Categories
|
| 163 |
+
|
| 164 |
+
The project uses multiple requirement files for different deployment scenarios:
|
| 165 |
+
|
| 166 |
+
| File | Purpose | Key Dependencies |
|
| 167 |
+
|------|---------|------------------|
|
| 168 |
+
| `requirements.txt` | Main application | PyTorch, Transformers, Gradio, PyMuPDF |
|
| 169 |
+
| `requirements_voice_gemma.txt` | Voice features | LitGPT, SNAC, Whisper, Librosa |
|
| 170 |
+
| `requirements_hf_spaces.txt` | HuggingFace deployment | Streamlined for cloud deployment |
|
| 171 |
+
| `requirements_gemma.txt` | Gemma-specific | Optimized for Gemma model usage |
|
| 172 |
+
|
| 173 |
+
### Key Components
|
| 174 |
+
|
| 175 |
+
#### PDF Processing (`app.py:convert_pdf_to_images_gradio`)
|
| 176 |
+
- Converts PDFs to high-quality images (2x scaling)
|
| 177 |
+
- Uses PyMuPDF for reliable extraction
|
| 178 |
+
|
| 179 |
+
#### Layout Analysis (`app.py:process_elements_optimized`)
|
| 180 |
+
- DOLPHIN identifies text blocks, tables, figures, headers
|
| 181 |
+
- Maintains proper reading order for accessibility
|
| 182 |
+
|
| 183 |
+
#### Alt Text Generation
|
| 184 |
+
- Gemma 3n 4B processes images with accessibility-focused prompts
|
| 185 |
+
- Generates 1-2 sentence descriptions following WCAG guidelines
|
| 186 |
+
- Low temperature (0.1) for consistent, reliable output
|
| 187 |
+
|
| 188 |
+
#### RAG System
|
| 189 |
+
- **Document chunking**: Smart overlap-based chunking (1024 tokens, 100 overlap)
|
| 190 |
+
- **Semantic retrieval**: SentenceTransformer embeddings with cosine similarity
|
| 191 |
+
- **Context integration**: Top-3 relevant chunks for accurate responses
|
| 192 |
+
|
| 193 |
+
#### Voice Chat System
|
| 194 |
+
- **Gemma 3n 2B**: Optimized for real-time voice processing
|
| 195 |
+
- **Silero VAD**: Voice Activity Detection for speech vs silence
|
| 196 |
+
- **gTTS**: Google Text-to-Speech for audio responses
|
| 197 |
+
- **Audio preprocessing**: 16kHz mono, normalized amplitude
|
| 198 |
+
|
| 199 |
+
## 🛠️ Technology Stack
|
| 200 |
+
|
| 201 |
+
| Component | Technology |
|
| 202 |
+
|-----------|------------|
|
| 203 |
+
| Frontend | Gradio web interface with streaming capabilities |
|
| 204 |
+
| AI Models | Gemma 3n, DOLPHIN, SentenceTransformer |
|
| 205 |
+
| Document Processing | PyMuPDF, OpenCV, PIL |
|
| 206 |
+
| Voice Processing | Librosa, VAD, gTTS |
|
| 207 |
+
| Search | SentenceTransformers for semantic retrieval |
|
| 208 |
+
|
| 209 |
+
## 🎨 Architecture Philosophy
|
| 210 |
+
|
| 211 |
+
### Right Tool for Right Job
|
| 212 |
+
- **DOLPHIN** for PDF extraction and layout analysis
|
| 213 |
+
- **Gemma 3n 4B** for alt text generation and document chat
|
| 214 |
+
- **Gemma 3n 2B** for real-time voice interaction
|
| 215 |
+
- Each component matched to its optimal model and specialization
|
| 216 |
+
|
| 217 |
+
### Privacy-First Design
|
| 218 |
+
- All processing happens locally to protect sensitive academic content
|
| 219 |
+
- Meets institutional privacy requirements for research documents
|
| 220 |
+
|
| 221 |
+
### Accessibility Focus
|
| 222 |
+
- AI-generated alt text makes academic papers inclusive for visually impaired researchers
|
| 223 |
+
- Addresses a real gap in academic publishing accessibility
|
| 224 |
+
|
| 225 |
+
## 🚀 Getting Started
|
| 226 |
+
|
| 227 |
+
1. **Install dependencies**: The app uses Gradio, PyMuPDF, and various AI model libraries
|
| 228 |
+
2. **Run the application**: `python app.py`
|
| 229 |
+
3. **Access the interface**: Open the Gradio web interface
|
| 230 |
+
4. **Upload a PDF**: Use the document processing tab to convert research papers
|
| 231 |
+
5. **Interact**: Chat with documents or use voice features for hands-free research
|
| 232 |
+
|
| 233 |
+
## 💡 Design Challenges Solved
|
| 234 |
+
|
| 235 |
+
### Challenge 1: Narrowing Down Big Ideas
|
| 236 |
+
- Focused on three core applications: alt text, document chat, and voice interaction
|
| 237 |
+
- Chose accessibility as the primary value proposition
|
| 238 |
+
- Specialized each model variant (4B vs 2B) for optimal performance
|
| 239 |
+
|
| 240 |
+
### Challenge 2: Storage Limitations
|
| 241 |
+
- Developed code-first approach with thorough review before testing
|
| 242 |
+
- Built comprehensive error handling upfront since debugging was expensive
|
| 243 |
+
- Improved documentation and commenting discipline
|
| 244 |
+
|
| 245 |
+
## 📈 Impact
|
| 246 |
+
|
| 247 |
+
Scholar Express bridges the accessibility gap in scientific research, ensuring that the 18-54% of students with disabilities can access the same research literature as their peers, while providing enhanced interaction capabilities for all users working with complex scientific content.
|