Document Processing

How George AI converts documents into searchable, AI-ready content

What is Document Processing?

Document processing is the automated pipeline that transforms raw files (PDFs, Word docs, images) into text content and vector embeddings that can be searched semantically and used by AI assistants.

Every file uploaded or crawled into a Library goes through this processing pipeline automatically, managed by a background queue system.

Extraction

Converts documents to markdown text using format-specific parsers and OCR for images

Embedding

Splits text into chunks and generates vector embeddings for semantic search

Processing Pipeline

When a file is uploaded or crawled, a processing task is automatically created and queued:

1

Task Creation

A content processing task is created and added to the queue with status pending

Status: pending
Processing started: (waiting)
2

Extraction Phase

Text and images are extracted from the document based on its format

Extraction Methods:

  • PDF: Text extraction + OCR for images
  • Office (Word, Excel, PPT): Native parsers
  • Images: Vision model OCR
  • Archives: Extract and process contents
  • HTML/Markdown: Direct conversion

Configuration Options:

  • • Enable/disable text extraction
  • • Enable/disable image processing
  • • OCR model selection
  • • OCR prompt customization
  • • OCR image scale
  • • OCR timeout

Settings configured at Library level

Status: extracting
Extraction started: 2025-01-15 10:30:15
Output: document.md (extracted markdown)
Extraction can timeout if the file is very large or complex. Default timeout is configurable per Library.
3

Embedding Phase

Extracted text is chunked and converted to vector embeddings for semantic search

Step Description
1. Chunking Text is split into smaller chunks (paragraphs or sections) for efficient processing
2. Embedding Each chunk is converted to a vector embedding using the configured AI model
3. Storage Embeddings are stored in Typesense vector database for fast semantic search
Status: embedding
Embedding started: 2025-01-15 10:30:45
Chunks created: 127
Embedding model: nomic-embed-text

Embedding Configuration (Library-level):

  • Embedding Model: Which AI model to use (e.g., nomic-embed-text, mxbai-embed-large)
  • Embedding Timeout: Maximum time allowed for embedding generation
4

Processing Complete

Task is marked as completed. File is now searchable and available for AI assistants.

Status: completed
Processing finished: 2025-01-15 10:31:02
Total processing time: 47,000 ms
File is now searchable!

Large File Handling

George AI uses streaming architecture and intelligent file size management to handle large documents efficiently without overwhelming system resources.

Streaming Architecture

Files are processed as streams rather than loading entire contents into memory. This enables handling of multi-GB files with constant memory usage.

Configurable Limits

Set maximum file sizes per crawler to prevent processing oversized files. Default limit is 100 MB, configurable up to several GB.

No Base64 Overhead

Binary files are transferred directly without Base64 encoding, saving 33% bandwidth and enabling faster processing of large files.

Parallel Processing

Multiple files can be processed simultaneously with configurable concurrency limits to balance speed and resource usage.

File Size Limits & Thresholds
Default Maximum
100 MB
Files larger than this are rejected
Warning Threshold
50 MB
Files larger trigger warning alerts
Memory Usage
Constant
Streaming prevents memory spikes
File Size Processing Behavior Recommendation
0 - 50 MB Processes normally without warnings Optimal size range for fast processing
50 - 100 MB Processes with warning logged Monitor processing time and quality
100 MB+ Rejected by crawler (configurable) Increase maxFileSizeBytes or split file

Configurable Limits

File size limits can be configured per crawler when creating or editing crawler settings.

How Large Files Are Processed

1. Streaming Download

Files are downloaded in chunks rather than loading entire content into memory:

  • • SMB crawler uses direct protocol streaming (no Base64)
  • • HTTP crawler streams via standard HTTP
  • • Memory usage remains constant (~1 KB per file)

2. Format-Specific Extraction

Different file types handle large sizes differently:

Format Large File Handling
PDF Page-by-page extraction, OCR batching
CSV/Excel Row-by-row streaming, automatic splitting
Word/PPT Section-by-section extraction
Images Direct OCR with optional scaling
Archives (ZIP) Extract and process individual files

3. Incremental Embedding

Text is chunked and embedded in batches:

  • • Large documents split into semantic chunks
  • • Embeddings generated in parallel batches
  • • Each chunk stored independently in vector database
  • • Enables processing of documents with 1000+ pages
CSV/Excel Automatic Splitting

Large CSV and Excel files are automatically split row-by-row for optimal processing:

Largest Tested
732K Rows
Successfully processed without issues
Memory Per Row
~1 KB
Constant regardless of file size
Search Precision
1 Row
Each search result = exact record

How It Works:

  1. CSV/Excel file is detected during upload
  2. Each row is converted to a separate markdown file
  3. Files organized in buckets (100 files per directory)
  4. Each row embedded independently for precise search
  5. Summary file created with column headers and statistics

Learn more about CSV File Splitting →

Automatic: No configuration needed - CSV splitting is enabled by default for all libraries.
Performance & Optimization Tips

✓ Use Appropriate File Size Limits

Set maxFileSizeBytes in crawler configuration to skip oversized files automatically. Prevents timeouts and resource exhaustion.

✓ Increase Timeouts for Large Files

Adjust extraction timeout and OCR timeout in Library settings when processing large PDFs with many images. Default timeouts may be too low for 500+ page documents.

✓ Optimize OCR Image Scale

Lower OCR image scale (e.g., 0.5) for large scanned PDFs to speed up processing. Higher scales (1.5-2.0) improve quality but increase processing time significantly.

✓ Split Very Large Files

For documents over 500 pages, consider splitting into smaller files before upload. This improves processing reliability and makes individual sections easier to find in search.

✓ Monitor System Resources

Use Admin Panel to monitor processing queue and AI Services status. If files consistently timeout, consider adding more AI Service instances or increasing system resources.

Troubleshooting Large File Issues

File Rejected: "Exceeds maximum file size"

Cause: File is larger than configured maximum (default 100 MB)

Solutions:

  • • Increase maxFileSizeBytes in crawler settings
  • • Split the file into smaller chunks before upload
  • • For CSV/Excel, use automatic row splitting (no size limit)

Processing Timeout on Large PDFs

Cause: Extraction or OCR timeout too low for document size

Solutions:

  • • Increase extraction timeout in Library settings
  • • Increase OCR timeout for image-heavy documents
  • • Reduce OCR image scale to speed up processing
  • • Disable image processing if text extraction is sufficient

Slow Processing for Multi-GB Files

Cause: Very large files take longer despite streaming

Solutions:

  • • This is normal behavior - larger files require more time
  • • Monitor processing queue to ensure task is progressing
  • • Consider splitting file if processing takes hours
  • • Ensure adequate AI Service resources (CPU, RAM)

Out of Memory Errors

Cause: System resources exhausted despite streaming

Solutions:

  • • Increase Docker container memory limits
  • • Reduce number of concurrent processing tasks
  • • Stop other resource-intensive services temporarily
  • • Check for memory leaks in AI Services logs

Processing Queue System

George AI uses a background queue system to manage processing tasks efficiently:

Content Processing Queue

Handles text extraction and embedding generation for all files

Enrichment Queue

Handles AI-powered data extraction for List enrichment fields

Queue Worker Behavior

Running

Worker continuously picks up pending tasks and processes them

Stopped

Worker is paused. No new tasks are processed, but tasks already in progress continue

Automatic Processing

Files are processed automatically when added to a Library. This behavior can be configured in Library settings using the "Auto-process crawled files" option.

Processing Task States

State Description Next Step
none
No processing has been initiated Wait for task creation or trigger manually
pending
Task is queued and waiting for a worker Worker will pick it up automatically
validating
File format and integrity are being checked Moves to extracting or validationFailed
extracting
Text and images are being extracted Moves to embedding or extractionFailed
embedding
Vector embeddings are being generated Moves to completed or embeddingFailed
completed
Processing finished successfully File is searchable and ready
failed
Processing failed at some stage Retry via file menu or queue management
timedOut
Processing exceeded configured timeout Retry with adjusted timeout or check file
cancelled
Task was manually cancelled Create new task if needed

Monitoring Processing

You can monitor and manage processing tasks through the Admin Panel:

Processing Queue Dashboard

Admin Panel → Processing Queue

View Task Statistics:

  • • Pending tasks count
  • • Currently processing tasks
  • • Failed tasks
  • • Completed tasks
  • • Last processed timestamp

Management Actions:

  • • Start/Stop queue workers
  • • Retry failed tasks
  • • Clear failed tasks
  • • Clear pending tasks
  • • Cancel specific tasks

Troubleshooting Processing Issues

Files stuck in "pending" state

Possible Causes:

  • Queue worker is stopped
  • Too many tasks overwhelming the queue
  • System resources exhausted

Solutions:

  • Check Admin Panel → Processing Queue to verify worker is running
  • Start the worker if stopped
  • Monitor system CPU/memory usage
  • Consider adding more AI Service servers for parallel processing
Extraction fails or times out

Possible Causes:

  • File is corrupted or unsupported format
  • File is extremely large (100+ pages)
  • OCR timeout too low for complex images
  • AI model not available

Solutions:

  • Verify file can be opened in its native application
  • Increase extraction timeout in Library settings
  • Increase OCR timeout for image-heavy documents
  • Check AI Services status
  • Try re-processing the file via file menu
Embedding fails or times out

Possible Causes:

  • Embedding model not loaded on AI Services
  • Document extracted to extremely large text
  • Embedding timeout too low
  • Typesense vector database connectivity issues

Solutions:

  • Verify embedding model is available (check Library settings)
  • Increase embedding timeout in Library settings
  • Check Typesense service status
  • Verify AI Services can connect to Typesense
  • Retry embedding via file menu
Poor OCR quality for images/scans

Possible Causes:

  • Low-resolution images
  • Poor scan quality
  • OCR prompt not optimized for document type
  • OCR image scale too low

Solutions:

  • Increase OCR image scale in Library settings (try 1.5 or 2.0)
  • Customize OCR prompt to describe document structure
  • Use higher-resolution source images if possible
  • Try a different OCR model (e.g., qwen2.5vl:latest)
  • Re-process file after adjusting settings

Related Topics

Learn more about related features:

George-Cloud