Crawlers

Automatically collect files from external sources into your Libraries

What are Crawlers?

Crawlers are automated collectors that continuously gather files from external sources like SharePoint, file shares, websites, and cloud storage, bringing them into your George AI Libraries.

Instead of manually uploading files, you configure a crawler once, and it automatically discovers and imports files—keeping your Library synchronized with the source.

One-Time Setup

Configure crawler once with source URL and credentials

Automatic Updates

Schedule crawlers to run daily, weekly, or manually trigger runs

Supported Sources

SharePoint Online

Crawl SharePoint document libraries and OneDrive folders

URI Format: https://yourcompany.sharepoint.com/sites/SiteName

Authentication: Browser cookies (FedAuth, rtFa)

SMB / Windows File Share

Access network file shares and Windows servers

URI Format: smb://server/share/folder

Authentication: Username + Password

HTTP/HTTPS Websites

Crawl public or internal websites and download files

URI Format: https://docs.example.com/files

Authentication: None (public sites only)

Box.com

Access Box.com enterprise cloud storage folders

URI Format: https://app.box.com/folder/123456789

Authentication: Customer ID + Developer Token

API / REST

Crawl REST APIs and e-commerce platforms with pre-configured templates

Templates: Shopware 6, Weclapp, Generic REST

Authentication: OAuth2, API Key, Bearer Token, Basic Auth

API crawlers will soon be replaced with Connectors for more powerful automation.

Creating a Crawler

  • Select Source Type
    SharePoint, SMB, HTTP, Box, or API
  • Configure URI & Limits
    URL, depth, max pages
  • Set File Filters
    Optional: patterns, size, MIME types
  • Add Credentials
    Authentication details
  • Schedule Runs
    Optional: daily/weekly automation
1
Basic Configuration
URI

Required

The source location to crawl (URL, network path, etc.)

https://company.sharepoint.com/sites/Docs
smb://fileserver/shared/documents
https://docs.example.com/files
https://app.box.com/folder/123456789
Max Depth

Required • Default: 2

How many folder levels deep to crawl (0 = only root folder)

Example: Depth 2 crawls /docs/docs/2024/docs/2024/Q1

Max Pages

Required • Default: 10

Maximum number of files to collect per crawler run

Prevents overwhelming the system with too many files at once

2
File Filters (Optional)
Include Patterns

Regex patterns to include specific files

\.pdf$, \.docx?$, \.txt$
Only collect PDF, DOC, DOCX, and TXT files
Exclude Patterns

Regex patterns to exclude files/folders

archive, _old, backup, temp, \.tmp$
Skip folders named "archive", "_old", etc.
Min File Size

Minimum file size in MB (e.g., 0.1 = 100 KB)

Use to skip tiny files that likely contain no useful content

Max File Size

Maximum file size in MB (e.g., 50 = 50 MB)

Use to skip extremely large files that may timeout during processing

Allowed MIME Types

Comma-separated list of allowed file types

application/pdf, text/plain, application/msword
3
Authentication

SharePoint Online

Requires browser authentication cookies:

  1. Open your SharePoint site in a browser and log in
  2. Open Developer Tools (F12) → Network tab
  3. Refresh the page or navigate to a document library
  4. Find any request to your SharePoint site
  5. Copy the complete "Cookie" header value (must include FedAuth and rtFa cookies)
  6. Paste into the "SharePoint Authentication Cookies" field
Cookies are session-based and expire. You may need to refresh them periodically if crawler runs start failing.

SMB / Windows File Share

Requires network credentials:

  • Username: Domain username (e.g., DOMAIN\username or username@domain.com)
  • Password: User's password

Box.com

Requires Box API credentials:

  • Customer ID: Your Box enterprise customer ID (10+ characters)
  • Developer Token: Box API developer token (20+ characters)

Contact your Box administrator to obtain these credentials.

HTTP/HTTPS Websites

No authentication required. Only public websites are supported.

4
Scheduling (Optional)

Schedule crawlers to run automatically on a recurring basis:

Cron Schedule Configuration

  • Active: Enable/disable scheduled runs
  • Time: Hour (0-23) and Minute (0-59) in 24-hour format
  • Days: Select which days of the week to run (Monday - Sunday)
Every weekday at 3:00 AM
Hour: 3, Minute: 0, Days: Mon-Fri
Sundays at 11:30 PM
Hour: 23, Minute: 30, Days: Sunday

Manual Runs

You can always trigger a crawler run manually from the Library crawler list, regardless of schedule settings.

Crawler Runs & Monitoring

Each time a crawler executes, it creates a "crawler run" with detailed statistics:

Run Information

Run Metadata:

  • • Start time and end time
  • • Duration
  • • Triggered by user (manual) or scheduler
  • • Success or failure status
  • • Error message (if failed)

Discovered Updates:

  • • New files found
  • • Modified files detected
  • • Deleted files identified
  • • Skipped files (hash unchanged)
  • • Error files (couldn't access)
Running Status
Crawler is running...
Started 5 minutes ago
Last Run
Completed
127 files discovered
Total Runs
48
Since creation

How Crawling Works

1. Discovery

Crawler navigates the source (folders, links, etc.) up to the configured max depth, discovering files that match filters

2. Change Detection

For each discovered file, crawler checks if it already exists in the Library by comparing content hash and modification date

3. File Import

New or modified files are downloaded and added to the Library. Unchanged files are skipped to save processing time.

4. Automatic Processing

If "Auto-process crawled files" is enabled in Library settings, new files are automatically queued for extraction and embedding

SMB / Windows File Share Details

George AI provides direct SMB2 protocol support for accessing Windows file shares and network storage. No mounting required—works seamlessly in Docker and cloud environments.

Direct SMB2 Protocol

Native SMB2/3 protocol implementation—no CIFS kernel modules or system mounts needed. Works in Docker, Kubernetes, and serverless environments.

Large File Support

Stream files directly without Base64 encoding overhead. Efficiently handle multi-GB files with constant memory usage.

Real-Time Progress

Server-Sent Events (SSE) provide live file discovery updates. See files as they're found during the crawl.

Incremental Updates

MD5 hash comparison detects file changes. Only new and modified files are downloaded and processed.

URI Format & Examples

SMB crawler URIs follow the standard UNC path format:

//server/share/optional/path
//fileserver.company.com/departments/marketing
//10.0.1.50/shared/documents
//nas-001/public/reports/2024
Component Description Required
// UNC path prefix Yes
server Hostname, FQDN, or IP address Yes
share SMB share name Yes
path Subdirectory within share No

Backslashes Not Supported

Use forward slashes only. Windows UNC format \\server\share should be written as //server/share
Authentication

SMB crawlers support username and password authentication:

Format Example Use Case
Username only john Local account on file server
Domain\Username COMPANY\john Active Directory domain account
username@domain.com john@company.com UPN format (modern AD)

Credentials Storage

Credentials are securely encrypted in the database. Only the crawler service can decrypt and use them for SMB connections.

File Filtering with Glob Patterns

Use glob patterns to control which files are crawled:

Include Patterns

Only files matching these patterns are crawled:

*.pdf                 # All PDF files
*.docx                # Word files
report-*.pdf          # PDFs starting with "report-"

Exclude Patterns

Files matching these patterns are skipped:

**/~$*            # Office temporary files
**/.git/**        # Git directories
**/temp/**        # Temp folders
**/archive/**     # Archived content
Pattern Matches
* Any characters except /
** Any characters including / (recursive)
? Single character
[abc] Any character in set
*.pdf or *.doc Multiple extensions (use separate patterns)
Performance & File Size Limits

SMB crawlers handle large files efficiently with streaming architecture:

Memory Usage
~1 KB
Constant per file, regardless of size
Max File Size
100 MB
Default limit (configurable)
Download Speed
Network
Limited by SMB connection speed
Streaming Architecture: Files are processed as they're downloaded. No need to buffer entire files in memory.
Common SMB Issues

Error: "Authentication failed"

Causes:

  • Incorrect username or password
  • Account locked or disabled
  • Domain name incorrect

Solutions:

  • Verify credentials can access the share from another computer
  • Use correct domain format (DOMAIN\user or user@domain.com)
  • Check with IT if account has SMB access permissions

Error: "Share not found" or "Connection refused"

Causes:

  • Server hostname/IP incorrect
  • Share name misspelled
  • Firewall blocking SMB ports (445)
  • Server not accessible from George AI network

Solutions:

  • Verify URI format: //server/share (forward slashes)
  • Test network connectivity (ping server from George AI container)
  • Ensure port 445 is open in firewall rules
  • Use IP address instead of hostname if DNS resolution fails

Issue: No files found

Causes:

  • Include patterns too restrictive
  • Path doesn't exist within share
  • User doesn't have read permissions

Solutions:

  • Try without include patterns first to see all files
  • Verify path exists by accessing share manually
  • Check user has at least read permissions on the folder

Issue: Slow crawl performance

Causes:

  • Crawling thousands of files
  • Slow network connection to file server
  • Large files being downloaded

Solutions:

  • Use more specific paths to reduce file count
  • Add exclude patterns for unnecessary folders
  • Set maxFileSizeBytes to skip very large files
  • Improve network bandwidth between George AI and file server

API / REST Crawler Details

George AI supports crawling REST APIs with pre-configured templates for popular e-commerce platforms and a generic REST template for custom APIs. API crawlers handle pagination, authentication, and data mapping automatically.

Migration to Connectors

API crawlers will soon be replaced with Connectors, which provide more powerful automation capabilities including bidirectional sync, field mapping UIs, and write operations. For new integrations, consider using Connectors instead.

Pre-Configured Templates

Ready-to-use templates for Shopware 6 and Weclapp e-commerce platforms, plus a generic REST template for custom APIs.

Multiple Auth Methods

Supports OAuth2, API keys, Bearer tokens, and Basic authentication. Credentials securely encrypted in database.

Automatic Pagination

Handles API pagination automatically, collecting all records across multiple API calls with configurable delays and concurrency.

Data Transformation

Converts API responses to markdown format automatically. Each product/record becomes a searchable document.

Available Templates

Shopware 6

E-commerce platform integration for product catalog crawling.

Provider shopware6
Endpoint /api/product
Authentication OAuth2 (Client Credentials)
Configuration
  • Client ID: OAuth2 integration client ID
  • Client Secret: OAuth2 integration client secret
  • Token URL: OAuth2 token endpoint
  • Associations: Related data to include (manufacturer, categories, cover.media, properties, etc.)
Rate Limiting 100ms delay between requests, max 3 concurrent requests
https://shop.example.com/api/product

Weclapp

ERP platform integration for product and inventory data.

Provider weclapp
Endpoint Configured per template
Authentication API Token
Configuration
  • API Token: Weclapp API authentication token
  • Tenant: Weclapp tenant/instance name

Generic REST

Flexible template for any REST API with custom configuration.

Provider custom
Endpoint Fully customizable API URL
Authentication API Key, Bearer Token, Basic Auth, or None
Configuration
  • Base URL: API base URL
  • Endpoint: API endpoint path
  • Headers: Custom HTTP headers (JSON format)
  • Pagination: Define how API handles pagination
  • Response Path: JSON path to data array in response
https://api.example.com/products
Authorization: Bearer your-token
data.products
Authentication Types

OAuth2 (Client Credentials)

Used by Shopware 6 template for secure API access.

  • Client ID: OAuth2 application/integration ID
  • Client Secret: OAuth2 application secret
  • Token URL: OAuth2 token endpoint
  • Scope: Optional - OAuth2 scopes to request

Access tokens are automatically refreshed when expired.

API Key

Simple key-based authentication via header or query parameter.

X-API-Key: your-api-key-here
?api_key=your-api-key-here

Bearer Token

Token-based authentication in Authorization header.

Authorization: Bearer your-token-here

Basic Authentication

Username and password encoded in Authorization header.

Authorization: Basic base64(username:password)
Configuration Options
Option Description Default
requestDelay Milliseconds to wait between API requests 100ms
maxConcurrency Maximum number of concurrent API requests 3
pageSize Number of records per API page/request 50
maxPages Maximum number of pages to crawl 10
associations Related data to include (Shopware 6 only) Empty
responsePath JSON path to data array in API response Varies

Rate Limiting

Use requestDelay and maxConcurrency to respect API rate limits and avoid being blocked.

Migration to Connectors

API crawlers are being phased out in favor of Connectors, which provide:

Connectors Advantages

  • ✓ Bidirectional sync (read and write)
  • ✓ Field mapping UI (no code required)
  • ✓ Write operations (create/update records)
  • ✓ Automation workflows
  • ✓ Error handling and retry logic
  • ✓ Real-time sync triggers

API Crawler Limitations

  • ✗ Read-only (no write operations)
  • ✗ Manual configuration (JSON/code)
  • ✗ Limited error handling
  • ✗ No field mapping UI
  • ✗ Manual scheduling only

Recommendation

For new integrations, we recommend using Connectors instead of API crawlers. Connectors provide a better user experience, more features, and easier maintenance.

See the Shopware 6 Connector documentation for an example of the new approach.

Best Practices

Start with Small Max Pages

Begin with maxPages=10 or 50 to test your crawler configuration. Once confirmed working, increase to larger numbers.

Use File Filters Wisely

Exclude unnecessary folders (archive, temp, backup) and limit file types to what you actually need. This speeds up crawling and reduces noise.

Schedule During Off-Hours

Run scheduled crawlers during low-traffic times (e.g., 3 AM) to avoid impacting system performance during business hours.

Monitor SharePoint Cookie Expiration

SharePoint authentication cookies typically expire after a few hours or days. If crawler runs start failing, refresh the cookies.

Troubleshooting

Issue Possible Cause Solution
Crawler run fails immediately Invalid credentials or expired cookies Verify credentials and refresh SharePoint cookies if applicable
No files discovered Filters too restrictive or maxDepth too low Review include/exclude patterns and increase maxDepth
Crawler times out Too many files or very slow network Reduce maxPages or increase timeout in crawler configuration
Files not processing after crawl "Auto-process crawled files" disabled Enable in Library settings or manually trigger processing
Scheduled runs not executing Cron job inactive or system scheduler stopped Verify cronJob.active is true and check system status

Related Topics

Learn more about file management and processing:

George-Cloud