Document Processing
How George AI converts documents into searchable, AI-ready content
What is Document Processing?
Document processing is the automated pipeline that transforms raw files (PDFs, Word docs, images) into text content and vector embeddings that can be searched semantically and used by AI assistants.
Every file uploaded or crawled into a Library goes through this processing pipeline automatically, managed by a background queue system.
Extraction
Converts documents to markdown text using format-specific parsers and OCR for images
Embedding
Splits text into chunks and generates vector embeddings for semantic search
Processing Pipeline
When a file is uploaded or crawled, a processing task is automatically created and queued:
Task Creation
A content processing task is created and added to the queue with status pending
Status: pending Processing started: (waiting) Extraction Phase
Text and images are extracted from the document based on its format
Extraction Methods:
- • PDF: Text extraction + OCR for images
- • Office (Word, Excel, PPT): Native parsers
- • Images: Vision model OCR
- • Archives: Extract and process contents
- • HTML/Markdown: Direct conversion
Configuration Options:
- • Enable/disable text extraction
- • Enable/disable image processing
- • OCR model selection
- • OCR prompt customization
- • OCR image scale
- • OCR timeout
Settings configured at Library level
Status: extracting Extraction started: 2025-01-15 10:30:15 Output: document.md (extracted markdown) Embedding Phase
Extracted text is chunked and converted to vector embeddings for semantic search
| Step | Description |
|---|---|
| 1. Chunking | Text is split into smaller chunks (paragraphs or sections) for efficient processing |
| 2. Embedding | Each chunk is converted to a vector embedding using the configured AI model |
| 3. Storage | Embeddings are stored in Typesense vector database for fast semantic search |
Status: embedding Embedding started: 2025-01-15 10:30:45 Chunks created: 127 Embedding model: nomic-embed-text Embedding Configuration (Library-level):
- • Embedding Model: Which AI model to use (e.g., nomic-embed-text, mxbai-embed-large)
- • Embedding Timeout: Maximum time allowed for embedding generation
Processing Complete
Task is marked as completed. File is now searchable and available for AI assistants.
Status: completed Processing finished: 2025-01-15 10:31:02 Total processing time: 47,000 ms File is now searchable! Large File Handling
George AI uses streaming architecture and intelligent file size management to handle large documents efficiently without overwhelming system resources.
Streaming Architecture
Files are processed as streams rather than loading entire contents into memory. This enables handling of multi-GB files with constant memory usage.
Configurable Limits
Set maximum file sizes per crawler to prevent processing oversized files. Default limit is 100 MB, configurable up to several GB.
No Base64 Overhead
Binary files are transferred directly without Base64 encoding, saving 33% bandwidth and enabling faster processing of large files.
Parallel Processing
Multiple files can be processed simultaneously with configurable concurrency limits to balance speed and resource usage.
| File Size | Processing Behavior | Recommendation |
|---|---|---|
| 0 - 50 MB | Processes normally without warnings | Optimal size range for fast processing |
| 50 - 100 MB | Processes with warning logged | Monitor processing time and quality |
| 100 MB+ | Rejected by crawler (configurable) | Increase maxFileSizeBytes or split file |
Configurable Limits
File size limits can be configured per crawler when creating or editing crawler settings.
1. Streaming Download
Files are downloaded in chunks rather than loading entire content into memory:
- • SMB crawler uses direct protocol streaming (no Base64)
- • HTTP crawler streams via standard HTTP
- • Memory usage remains constant (~1 KB per file)
2. Format-Specific Extraction
Different file types handle large sizes differently:
| Format | Large File Handling |
|---|---|
| Page-by-page extraction, OCR batching | |
| CSV/Excel | Row-by-row streaming, automatic splitting |
| Word/PPT | Section-by-section extraction |
| Images | Direct OCR with optional scaling |
| Archives (ZIP) | Extract and process individual files |
3. Incremental Embedding
Text is chunked and embedded in batches:
- • Large documents split into semantic chunks
- • Embeddings generated in parallel batches
- • Each chunk stored independently in vector database
- • Enables processing of documents with 1000+ pages
Large CSV and Excel files are automatically split row-by-row for optimal processing:
How It Works:
- CSV/Excel file is detected during upload
- Each row is converted to a separate markdown file
- Files organized in buckets (100 files per directory)
- Each row embedded independently for precise search
- Summary file created with column headers and statistics
✓ Use Appropriate File Size Limits
Set maxFileSizeBytes in crawler configuration to skip oversized files automatically. Prevents timeouts and resource exhaustion.
✓ Increase Timeouts for Large Files
Adjust extraction timeout and OCR timeout in Library settings when processing large PDFs with many images. Default timeouts may be too low for 500+ page documents.
✓ Optimize OCR Image Scale
Lower OCR image scale (e.g., 0.5) for large scanned PDFs to speed up processing. Higher scales (1.5-2.0) improve quality but increase processing time significantly.
✓ Split Very Large Files
For documents over 500 pages, consider splitting into smaller files before upload. This improves processing reliability and makes individual sections easier to find in search.
✓ Monitor System Resources
Use Admin Panel to monitor processing queue and AI Services status. If files consistently timeout, consider adding more AI Service instances or increasing system resources.
File Rejected: "Exceeds maximum file size"
Cause: File is larger than configured maximum (default 100 MB)
Solutions:
- • Increase maxFileSizeBytes in crawler settings
- • Split the file into smaller chunks before upload
- • For CSV/Excel, use automatic row splitting (no size limit)
Processing Timeout on Large PDFs
Cause: Extraction or OCR timeout too low for document size
Solutions:
- • Increase extraction timeout in Library settings
- • Increase OCR timeout for image-heavy documents
- • Reduce OCR image scale to speed up processing
- • Disable image processing if text extraction is sufficient
Slow Processing for Multi-GB Files
Cause: Very large files take longer despite streaming
Solutions:
- • This is normal behavior - larger files require more time
- • Monitor processing queue to ensure task is progressing
- • Consider splitting file if processing takes hours
- • Ensure adequate AI Service resources (CPU, RAM)
Out of Memory Errors
Cause: System resources exhausted despite streaming
Solutions:
- • Increase Docker container memory limits
- • Reduce number of concurrent processing tasks
- • Stop other resource-intensive services temporarily
- • Check for memory leaks in AI Services logs
Processing Queue System
George AI uses a background queue system to manage processing tasks efficiently:
Content Processing Queue
Handles text extraction and embedding generation for all files
Enrichment Queue
Handles AI-powered data extraction for List enrichment fields
Queue Worker Behavior
Worker continuously picks up pending tasks and processes them
Worker is paused. No new tasks are processed, but tasks already in progress continue
Automatic Processing
Files are processed automatically when added to a Library. This behavior can be configured in Library settings using the "Auto-process crawled files" option.
Processing Task States
| State | Description | Next Step |
|---|---|---|
none | No processing has been initiated | Wait for task creation or trigger manually |
pending | Task is queued and waiting for a worker | Worker will pick it up automatically |
validating | File format and integrity are being checked | Moves to extracting or validationFailed |
extracting | Text and images are being extracted | Moves to embedding or extractionFailed |
embedding | Vector embeddings are being generated | Moves to completed or embeddingFailed |
completed | Processing finished successfully | File is searchable and ready |
failed | Processing failed at some stage | Retry via file menu or queue management |
timedOut | Processing exceeded configured timeout | Retry with adjusted timeout or check file |
cancelled | Task was manually cancelled | Create new task if needed |
Monitoring Processing
You can monitor and manage processing tasks through the Admin Panel:
Processing Queue Dashboard
Admin Panel → Processing Queue
View Task Statistics:
- • Pending tasks count
- • Currently processing tasks
- • Failed tasks
- • Completed tasks
- • Last processed timestamp
Management Actions:
- • Start/Stop queue workers
- • Retry failed tasks
- • Clear failed tasks
- • Clear pending tasks
- • Cancel specific tasks
Troubleshooting Processing Issues
Possible Causes:
- Queue worker is stopped
- Too many tasks overwhelming the queue
- System resources exhausted
Solutions:
- Check Admin Panel → Processing Queue to verify worker is running
- Start the worker if stopped
- Monitor system CPU/memory usage
- Consider adding more AI Service servers for parallel processing
Possible Causes:
- File is corrupted or unsupported format
- File is extremely large (100+ pages)
- OCR timeout too low for complex images
- AI model not available
Solutions:
- Verify file can be opened in its native application
- Increase extraction timeout in Library settings
- Increase OCR timeout for image-heavy documents
- Check AI Services status
- Try re-processing the file via file menu
Possible Causes:
- Embedding model not loaded on AI Services
- Document extracted to extremely large text
- Embedding timeout too low
- Typesense vector database connectivity issues
Solutions:
- Verify embedding model is available (check Library settings)
- Increase embedding timeout in Library settings
- Check Typesense service status
- Verify AI Services can connect to Typesense
- Retry embedding via file menu
Possible Causes:
- Low-resolution images
- Poor scan quality
- OCR prompt not optimized for document type
- OCR image scale too low
Solutions:
- Increase OCR image scale in Library settings (try 1.5 or 2.0)
- Customize OCR prompt to describe document structure
- Use higher-resolution source images if possible
- Try a different OCR model (e.g., qwen2.5vl:latest)
- Re-process file after adjusting settings