Document Processing

How George AI converts documents into searchable, AI-ready content

What is Document Processing?

Document processing is the automated pipeline that transforms raw files (PDFs, Word docs, images) into text content and vector embeddings that can be searched semantically and used by AI assistants.

Every file uploaded or crawled into a Library goes through this processing pipeline automatically, managed by a background queue system.

Extraction

Converts documents to markdown text using format-specific parsers and OCR for images

Embedding

Splits text into chunks and generates vector embeddings for semantic search

Processing Pipeline

When a file is uploaded or crawled, a processing task is automatically created and queued:

Task Creation

A content processing task is created and added to the queue with status pending

Status: pending

Processing started: (waiting)

Extraction Phase

Text and images are extracted from the document based on its format

Extraction Methods:

• PDF: Text extraction + OCR for images
• Office (Word, Excel, PPT): Native parsers
• Images: Vision model OCR
• Archives: Extract and process contents
• HTML/Markdown: Direct conversion

Configuration Options:

• Enable/disable text extraction
• Enable/disable image processing
• OCR model selection
• OCR prompt customization
• OCR image scale
• OCR timeout

Settings configured at Library level

Status: extracting

Extraction started: 2025-01-15 10:30:15

Output: document.md (extracted markdown)

Extraction can timeout if the file is very large or complex. Default timeout is configurable per Library.

Embedding Phase

Extracted text is chunked and converted to vector embeddings for semantic search

Step	Description
1. Chunking	Text is split into smaller chunks (paragraphs or sections) for efficient processing
2. Embedding	Each chunk is converted to a vector embedding using the configured AI model
3. Storage	Embeddings are stored in Typesense vector database for fast semantic search

Status: embedding

Embedding started: 2025-01-15 10:30:45

Chunks created: 127

Embedding model: nomic-embed-text

Embedding Configuration (Library-level):

• Embedding Model: Which AI model to use (e.g., nomic-embed-text, mxbai-embed-large)
• Embedding Timeout: Maximum time allowed for embedding generation

Processing Complete

Task is marked as completed. File is now searchable and available for AI assistants.

Status: completed

Processing finished: 2025-01-15 10:31:02

Total processing time: 47,000 ms

File is now searchable!

Large File Handling

George AI uses streaming architecture and intelligent file size management to handle large documents efficiently without overwhelming system resources.

Streaming Architecture

Files are processed as streams rather than loading entire contents into memory. This enables handling of multi-GB files with constant memory usage.

Configurable Limits

Set maximum file sizes per crawler to prevent processing oversized files. Default limit is 100 MB, configurable up to several GB.

No Base64 Overhead

Binary files are transferred directly without Base64 encoding, saving 33% bandwidth and enabling faster processing of large files.

Parallel Processing

Multiple files can be processed simultaneously with configurable concurrency limits to balance speed and resource usage.

File Size Limits & Thresholds

Default Maximum

100 MB

Files larger than this are rejected

Warning Threshold

50 MB

Files larger trigger warning alerts

Memory Usage

Constant

Streaming prevents memory spikes

File Size	Processing Behavior	Recommendation
0 - 50 MB	Processes normally without warnings	Optimal size range for fast processing
50 - 100 MB	Processes with warning logged	Monitor processing time and quality
100 MB+	Rejected by crawler (configurable)	Increase maxFileSizeBytes or split file

Configurable Limits

File size limits can be configured per crawler when creating or editing crawler settings.

How Large Files Are Processed

1. Streaming Download

Files are downloaded in chunks rather than loading entire content into memory:

• SMB crawler uses direct protocol streaming (no Base64)
• HTTP crawler streams via standard HTTP
• Memory usage remains constant (~1 KB per file)

2. Format-Specific Extraction

Different file types handle large sizes differently:

Format	Large File Handling
PDF	Page-by-page extraction, OCR batching
CSV/Excel	Row-by-row streaming, automatic splitting
Word/PPT	Section-by-section extraction
Images	Direct OCR with optional scaling
Archives (ZIP)	Extract and process individual files

3. Incremental Embedding

Text is chunked and embedded in batches:

• Large documents split into semantic chunks
• Embeddings generated in parallel batches
• Each chunk stored independently in vector database
• Enables processing of documents with 1000+ pages

CSV/Excel Automatic Splitting

Large CSV and Excel files are automatically split row-by-row for optimal processing:

Largest Tested

732K Rows

Successfully processed without issues

Memory Per Row

~1 KB

Constant regardless of file size

Search Precision

1 Row

Each search result = exact record

How It Works:

CSV/Excel file is detected during upload
Each row is converted to a separate markdown file
Files organized in buckets (100 files per directory)
Each row embedded independently for precise search
Summary file created with column headers and statistics

Learn more about CSV File Splitting →

Automatic: No configuration needed - CSV splitting is enabled by default for all libraries.

Performance & Optimization Tips

✓ Use Appropriate File Size Limits

Set maxFileSizeBytes in crawler configuration to skip oversized files automatically. Prevents timeouts and resource exhaustion.

✓ Increase Timeouts for Large Files

Adjust extraction timeout and OCR timeout in Library settings when processing large PDFs with many images. Default timeouts may be too low for 500+ page documents.

✓ Optimize OCR Image Scale

Lower OCR image scale (e.g., 0.5) for large scanned PDFs to speed up processing. Higher scales (1.5-2.0) improve quality but increase processing time significantly.

✓ Split Very Large Files

For documents over 500 pages, consider splitting into smaller files before upload. This improves processing reliability and makes individual sections easier to find in search.

✓ Monitor System Resources

Use Admin Panel to monitor processing queue and AI Services status. If files consistently timeout, consider adding more AI Service instances or increasing system resources.

Troubleshooting Large File Issues

File Rejected: "Exceeds maximum file size"

Cause: File is larger than configured maximum (default 100 MB)

Solutions:

• Increase maxFileSizeBytes in crawler settings
• Split the file into smaller chunks before upload
• For CSV/Excel, use automatic row splitting (no size limit)

Processing Timeout on Large PDFs

Cause: Extraction or OCR timeout too low for document size

Solutions:

• Increase extraction timeout in Library settings
• Increase OCR timeout for image-heavy documents
• Reduce OCR image scale to speed up processing
• Disable image processing if text extraction is sufficient

Slow Processing for Multi-GB Files

Cause: Very large files take longer despite streaming

Solutions:

• This is normal behavior - larger files require more time
• Monitor processing queue to ensure task is progressing
• Consider splitting file if processing takes hours
• Ensure adequate AI Service resources (CPU, RAM)

Out of Memory Errors

Cause: System resources exhausted despite streaming

Solutions:

• Increase Docker container memory limits
• Reduce number of concurrent processing tasks
• Stop other resource-intensive services temporarily
• Check for memory leaks in AI Services logs

Processing Queue System

George AI uses a background queue system to manage processing tasks efficiently:

Content Processing Queue

Handles text extraction and embedding generation for all files

Enrichment Queue

Handles AI-powered data extraction for List enrichment fields

Queue Worker Behavior

Running

Worker continuously picks up pending tasks and processes them

Stopped

Worker is paused. No new tasks are processed, but tasks already in progress continue

Automatic Processing

Files are processed automatically when added to a Library. This behavior can be configured in Library settings using the "Auto-process crawled files" option.

Processing Task States

State	Description	Next Step
none	No processing has been initiated	Wait for task creation or trigger manually
pending	Task is queued and waiting for a worker	Worker will pick it up automatically
validating	File format and integrity are being checked	Moves to extracting or validationFailed
extracting	Text and images are being extracted	Moves to embedding or extractionFailed
embedding	Vector embeddings are being generated	Moves to completed or embeddingFailed
completed	Processing finished successfully	File is searchable and ready
failed	Processing failed at some stage	Retry via file menu or queue management
timedOut	Processing exceeded configured timeout	Retry with adjusted timeout or check file
cancelled	Task was manually cancelled	Create new task if needed

Monitoring Processing

You can monitor and manage processing tasks through the Admin Panel:

Processing Queue Dashboard

Admin Panel → Processing Queue

View Task Statistics:

• Pending tasks count
• Currently processing tasks
• Failed tasks
• Completed tasks
• Last processed timestamp

Management Actions:

• Start/Stop queue workers
• Retry failed tasks
• Clear failed tasks
• Clear pending tasks
• Cancel specific tasks

Learn about Processing Queue Management →

Troubleshooting Processing Issues

Files stuck in "pending" state

Possible Causes:

Queue worker is stopped
Too many tasks overwhelming the queue
System resources exhausted

Solutions:

Check Admin Panel → Processing Queue to verify worker is running
Start the worker if stopped
Monitor system CPU/memory usage
Consider adding more AI Service servers for parallel processing

Extraction fails or times out

Possible Causes:

File is corrupted or unsupported format
File is extremely large (100+ pages)
OCR timeout too low for complex images
AI model not available

Solutions:

Verify file can be opened in its native application
Increase extraction timeout in Library settings
Increase OCR timeout for image-heavy documents
Check AI Services status
Try re-processing the file via file menu

Embedding fails or times out

Possible Causes:

Embedding model not loaded on AI Services
Document extracted to extremely large text
Embedding timeout too low
Typesense vector database connectivity issues

Solutions:

Verify embedding model is available (check Library settings)
Increase embedding timeout in Library settings
Check Typesense service status
Verify AI Services can connect to Typesense
Retry embedding via file menu

Poor OCR quality for images/scans

Possible Causes:

Low-resolution images
Poor scan quality
OCR prompt not optimized for document type
OCR image scale too low

Solutions:

Increase OCR image scale in Library settings (try 1.5 or 2.0)
Customize OCR prompt to describe document structure
Use higher-resolution source images if possible
Try a different OCR model (e.g., qwen2.5vl:latest)
Re-process file after adjusting settings

Document Processing

What is Document Processing?

Extraction

Embedding

Processing Pipeline

Task Creation

Extraction Phase

Extraction Methods:

Configuration Options:

Embedding Phase

Embedding Configuration (Library-level):

Processing Complete

Large File Handling

Streaming Architecture

Configurable Limits

No Base64 Overhead

Parallel Processing

1. Streaming Download

2. Format-Specific Extraction

3. Incremental Embedding

How It Works:

✓ Use Appropriate File Size Limits

✓ Increase Timeouts for Large Files

✓ Optimize OCR Image Scale

✓ Split Very Large Files

✓ Monitor System Resources

File Rejected: "Exceeds maximum file size"

Processing Timeout on Large PDFs

Slow Processing for Multi-GB Files

Out of Memory Errors

Processing Queue System

Content Processing Queue

Enrichment Queue

Queue Worker Behavior

Processing Task States

Monitoring Processing

Processing Queue Dashboard

View Task Statistics:

Management Actions:

Troubleshooting Processing Issues

Related Topics