Crawlers

Automatically collect files from external sources into your Libraries

What are Crawlers?

Crawlers are automated collectors that continuously gather files from external sources like SharePoint, file shares, websites, and cloud storage, bringing them into your George AI Libraries.

Instead of manually uploading files, you configure a crawler once, and it automatically discovers and imports files—keeping your Library synchronized with the source.

One-Time Setup

Configure crawler once with source URL and credentials

Automatic Updates

Schedule crawlers to run daily, weekly, or manually trigger runs

Supported Sources

SharePoint Online

Crawl SharePoint document libraries and OneDrive folders

URI Format: https://yourcompany.sharepoint.com/sites/SiteName

Authentication: Browser cookies (FedAuth, rtFa)

SMB / Windows File Share

Access network file shares and Windows servers

URI Format: smb://server/share/folder

Authentication: Username + Password

HTTP/HTTPS Websites

Crawl public or internal websites and download files

URI Format: https://docs.example.com/files

Authentication: None (public sites only)

Box.com

Access Box.com enterprise cloud storage folders

URI Format: https://app.box.com/folder/123456789

Authentication: Customer ID + Developer Token

API / REST

Crawl REST APIs and e-commerce platforms with pre-configured templates

Templates: Shopware 6, Weclapp, Generic REST

Authentication: OAuth2, API Key, Bearer Token, Basic Auth

API crawlers will soon be replaced with Connectors for more powerful automation.

Creating a Crawler

Select Source Type

SharePoint, SMB, HTTP, Box, or API
Configure URI & Limits

URL, depth, max pages
Set File Filters

Optional: patterns, size, MIME types
Add Credentials

Authentication details
Schedule Runs

Optional: daily/weekly automation

Basic Configuration

URI

Required

The source location to crawl (URL, network path, etc.)

https://company.sharepoint.com/sites/Docs

smb://fileserver/shared/documents

https://docs.example.com/files

https://app.box.com/folder/123456789

Max Depth

Required • Default: 2

How many folder levels deep to crawl (0 = only root folder)

Example: Depth 2 crawls /docs → /docs/2024 → /docs/2024/Q1

Max Pages

Required • Default: 10

Maximum number of files to collect per crawler run

Prevents overwhelming the system with too many files at once

File Filters (Optional)

Include Patterns	Regex patterns to include specific files `\.pdf$, \.docx?$, \.txt$` `Only collect PDF, DOC, DOCX, and TXT files`
Exclude Patterns	Regex patterns to exclude files/folders `archive, _old, backup, temp, \.tmp$` `Skip folders named "archive", "_old", etc.`
Min File Size	Minimum file size in MB (e.g., 0.1 = 100 KB) Use to skip tiny files that likely contain no useful content
Max File Size	Maximum file size in MB (e.g., 50 = 50 MB) Use to skip extremely large files that may timeout during processing
Allowed MIME Types	Comma-separated list of allowed file types `application/pdf, text/plain, application/msword`

Authentication

SharePoint Online

Requires browser authentication cookies:

Open your SharePoint site in a browser and log in
Open Developer Tools (F12) → Network tab
Refresh the page or navigate to a document library
Find any request to your SharePoint site
Copy the complete "Cookie" header value (must include FedAuth and rtFa cookies)
Paste into the "SharePoint Authentication Cookies" field

Cookies are session-based and expire. You may need to refresh them periodically if crawler runs start failing.

SMB / Windows File Share

Requires network credentials:

Username: Domain username (e.g., DOMAIN\username or username@domain.com)
Password: User's password

Box.com

Requires Box API credentials:

Customer ID: Your Box enterprise customer ID (10+ characters)
Developer Token: Box API developer token (20+ characters)

Contact your Box administrator to obtain these credentials.

HTTP/HTTPS Websites

No authentication required. Only public websites are supported.

Scheduling (Optional)

Schedule crawlers to run automatically on a recurring basis:

Cron Schedule Configuration

Active: Enable/disable scheduled runs
Time: Hour (0-23) and Minute (0-59) in 24-hour format
Days: Select which days of the week to run (Monday - Sunday)

Every weekday at 3:00 AM

Hour: 3, Minute: 0, Days: Mon-Fri

Sundays at 11:30 PM

Hour: 23, Minute: 30, Days: Sunday

Manual Runs

You can always trigger a crawler run manually from the Library crawler list, regardless of schedule settings.

Crawler Runs & Monitoring

Each time a crawler executes, it creates a "crawler run" with detailed statistics:

Run Information

Run Metadata:

• Start time and end time
• Duration
• Triggered by user (manual) or scheduler
• Success or failure status
• Error message (if failed)

Discovered Updates:

• New files found
• Modified files detected
• Deleted files identified
• Skipped files (hash unchanged)
• Error files (couldn't access)

Running Status

Crawler is running...

Started 5 minutes ago

Last Run

Completed

127 files discovered

Total Runs

Since creation

How Crawling Works

1. Discovery

Crawler navigates the source (folders, links, etc.) up to the configured max depth, discovering files that match filters

2. Change Detection

For each discovered file, crawler checks if it already exists in the Library by comparing content hash and modification date

3. File Import

New or modified files are downloaded and added to the Library. Unchanged files are skipped to save processing time.

4. Automatic Processing

If "Auto-process crawled files" is enabled in Library settings, new files are automatically queued for extraction and embedding

SMB / Windows File Share Details

George AI provides direct SMB2 protocol support for accessing Windows file shares and network storage. No mounting required—works seamlessly in Docker and cloud environments.

Direct SMB2 Protocol

Native SMB2/3 protocol implementation—no CIFS kernel modules or system mounts needed. Works in Docker, Kubernetes, and serverless environments.

Large File Support

Stream files directly without Base64 encoding overhead. Efficiently handle multi-GB files with constant memory usage.

Real-Time Progress

Server-Sent Events (SSE) provide live file discovery updates. See files as they're found during the crawl.

Incremental Updates

MD5 hash comparison detects file changes. Only new and modified files are downloaded and processed.

URI Format & Examples

SMB crawler URIs follow the standard UNC path format:

//server/share/optional/path

//fileserver.company.com/departments/marketing

//10.0.1.50/shared/documents

//nas-001/public/reports/2024

Component	Description	Required
//	UNC path prefix	Yes
server	Hostname, FQDN, or IP address	Yes
share	SMB share name	Yes
path	Subdirectory within share	No

Backslashes Not Supported

Use forward slashes only. Windows UNC format \\server\share should be written as //server/share

Authentication

SMB crawlers support username and password authentication:

Format	Example	Use Case
Username only	john	Local account on file server
Domain\Username	COMPANY\john	Active Directory domain account
username@domain.com	john@company.com	UPN format (modern AD)

Credentials Storage

Credentials are securely encrypted in the database. Only the crawler service can decrypt and use them for SMB connections.

File Filtering with Glob Patterns

Use glob patterns to control which files are crawled:

Include Patterns

Only files matching these patterns are crawled:

*.pdf                 # All PDF files

*.docx                # Word files

report-*.pdf          # PDFs starting with "report-"

Exclude Patterns

Files matching these patterns are skipped:

**/~$*            # Office temporary files

**/.git/**        # Git directories

**/temp/**        # Temp folders

**/archive/**     # Archived content

Pattern	Matches
`*`	Any characters except `/`
`**`	Any characters including `/` (recursive)
`?`	Single character
`[abc]`	Any character in set
.pdf or .doc	Multiple extensions (use separate patterns)

Performance & File Size Limits

SMB crawlers handle large files efficiently with streaming architecture:

Memory Usage

~1 KB

Constant per file, regardless of size

Max File Size

100 MB

Default limit (configurable)

Download Speed

Network

Limited by SMB connection speed

Streaming Architecture: Files are processed as they're downloaded. No need to buffer entire files in memory.

Common SMB Issues

Error: "Authentication failed"

Causes:

Incorrect username or password
Account locked or disabled
Domain name incorrect

Solutions:

Verify credentials can access the share from another computer
Use correct domain format (DOMAIN\user or user@domain.com)
Check with IT if account has SMB access permissions

Error: "Share not found" or "Connection refused"

Causes:

Server hostname/IP incorrect
Share name misspelled
Firewall blocking SMB ports (445)
Server not accessible from George AI network

Solutions:

Verify URI format: //server/share (forward slashes)
Test network connectivity (ping server from George AI container)
Ensure port 445 is open in firewall rules
Use IP address instead of hostname if DNS resolution fails

Issue: No files found

Causes:

Include patterns too restrictive
Path doesn't exist within share
User doesn't have read permissions

Solutions:

Try without include patterns first to see all files
Verify path exists by accessing share manually
Check user has at least read permissions on the folder

Issue: Slow crawl performance

Causes:

Crawling thousands of files
Slow network connection to file server
Large files being downloaded

Solutions:

Use more specific paths to reduce file count
Add exclude patterns for unnecessary folders
Set maxFileSizeBytes to skip very large files
Improve network bandwidth between George AI and file server

API / REST Crawler Details

George AI supports crawling REST APIs with pre-configured templates for popular e-commerce platforms and a generic REST template for custom APIs. API crawlers handle pagination, authentication, and data mapping automatically.

Migration to Connectors

API crawlers will soon be replaced with Connectors, which provide more powerful automation capabilities including bidirectional sync, field mapping UIs, and write operations. For new integrations, consider using Connectors instead.

Pre-Configured Templates

Ready-to-use templates for Shopware 6 and Weclapp e-commerce platforms, plus a generic REST template for custom APIs.

Multiple Auth Methods

Supports OAuth2, API keys, Bearer tokens, and Basic authentication. Credentials securely encrypted in database.

Automatic Pagination

Handles API pagination automatically, collecting all records across multiple API calls with configurable delays and concurrency.

Data Transformation

Converts API responses to markdown format automatically. Each product/record becomes a searchable document.

Available Templates

Shopware 6

E-commerce platform integration for product catalog crawling.

Provider	shopware6
Endpoint	/api/product
Authentication	OAuth2 (Client Credentials)
Configuration	Client ID: OAuth2 integration client ID Client Secret: OAuth2 integration client secret Token URL: OAuth2 token endpoint Associations: Related data to include (manufacturer, categories, cover.media, properties, etc.)
Rate Limiting	100ms delay between requests, max 3 concurrent requests

https://shop.example.com/api/product

Weclapp

ERP platform integration for product and inventory data.

Provider	weclapp
Endpoint	Configured per template
Authentication	API Token
Configuration	API Token: Weclapp API authentication token Tenant: Weclapp tenant/instance name

Generic REST

Flexible template for any REST API with custom configuration.

Provider	custom
Endpoint	Fully customizable API URL
Authentication	API Key, Bearer Token, Basic Auth, or None
Configuration	Base URL: API base URL Endpoint: API endpoint path Headers: Custom HTTP headers (JSON format) Pagination: Define how API handles pagination Response Path: JSON path to data array in response

https://api.example.com/products

Authorization: Bearer your-token

data.products

Authentication Types

OAuth2 (Client Credentials)

Used by Shopware 6 template for secure API access.

Client ID: OAuth2 application/integration ID
Client Secret: OAuth2 application secret
Token URL: OAuth2 token endpoint
Scope: Optional - OAuth2 scopes to request

Access tokens are automatically refreshed when expired.

API Key

Simple key-based authentication via header or query parameter.

X-API-Key: your-api-key-here

?api_key=your-api-key-here

Bearer Token

Token-based authentication in Authorization header.

Authorization: Bearer your-token-here

Basic Authentication

Username and password encoded in Authorization header.

Authorization: Basic base64(username:password)

Configuration Options

Option	Description	Default
requestDelay	Milliseconds to wait between API requests	100ms
maxConcurrency	Maximum number of concurrent API requests	3
pageSize	Number of records per API page/request	50
maxPages	Maximum number of pages to crawl	10
associations	Related data to include (Shopware 6 only)	Empty
responsePath	JSON path to data array in API response	Varies

Rate Limiting

Use requestDelay and maxConcurrency to respect API rate limits and avoid being blocked.

Migration to Connectors

API crawlers are being phased out in favor of Connectors, which provide:

Connectors Advantages

✓ Bidirectional sync (read and write)
✓ Field mapping UI (no code required)
✓ Write operations (create/update records)
✓ Automation workflows
✓ Error handling and retry logic
✓ Real-time sync triggers

API Crawler Limitations

✗ Read-only (no write operations)
✗ Manual configuration (JSON/code)
✗ Limited error handling
✗ No field mapping UI
✗ Manual scheduling only

Recommendation

For new integrations, we recommend using Connectors instead of API crawlers. Connectors provide a better user experience, more features, and easier maintenance.

See the Shopware 6 Connector documentation for an example of the new approach.

Best Practices

Start with Small Max Pages

Begin with maxPages=10 or 50 to test your crawler configuration. Once confirmed working, increase to larger numbers.

Use File Filters Wisely

Exclude unnecessary folders (archive, temp, backup) and limit file types to what you actually need. This speeds up crawling and reduces noise.

Schedule During Off-Hours

Run scheduled crawlers during low-traffic times (e.g., 3 AM) to avoid impacting system performance during business hours.

Monitor SharePoint Cookie Expiration

SharePoint authentication cookies typically expire after a few hours or days. If crawler runs start failing, refresh the cookies.

Troubleshooting

Issue	Possible Cause	Solution
Crawler run fails immediately	Invalid credentials or expired cookies	Verify credentials and refresh SharePoint cookies if applicable
No files discovered	Filters too restrictive or maxDepth too low	Review include/exclude patterns and increase maxDepth
Crawler times out	Too many files or very slow network	Reduce maxPages or increase timeout in crawler configuration
Files not processing after crawl	"Auto-process crawled files" disabled	Enable in Library settings or manually trigger processing
Scheduled runs not executing	Cron job inactive or system scheduler stopped	Verify cronJob.active is true and check system status

Crawlers

What are Crawlers?

One-Time Setup

Automatic Updates

Supported Sources

SharePoint Online

SMB / Windows File Share

HTTP/HTTPS Websites

Box.com

API / REST

Creating a Crawler

SharePoint Online

SMB / Windows File Share

Box.com

HTTP/HTTPS Websites

Cron Schedule Configuration

Crawler Runs & Monitoring

Run Information

Run Metadata:

Discovered Updates:

How Crawling Works

1. Discovery

2. Change Detection

3. File Import

4. Automatic Processing

SMB / Windows File Share Details

Direct SMB2 Protocol

Large File Support

Real-Time Progress

Incremental Updates

Backslashes Not Supported

Include Patterns

Exclude Patterns

Error: "Authentication failed"

Error: "Share not found" or "Connection refused"

Issue: No files found

Issue: Slow crawl performance

API / REST Crawler Details

Migration to Connectors

Pre-Configured Templates

Multiple Auth Methods

Automatic Pagination

Data Transformation

Shopware 6

Weclapp

Generic REST

OAuth2 (Client Credentials)

API Key

Bearer Token

Basic Authentication

Connectors Advantages

API Crawler Limitations

Recommendation

Best Practices

Start with Small Max Pages

Use File Filters Wisely

Schedule During Off-Hours

Monitor SharePoint Cookie Expiration

Troubleshooting

Related Topics