Configuration Guide

Complete reference for all Duckling configuration options.

Environment Variables

Create a .env file in the backend directory:

# Flask Configuration
FLASK_ENV=development          # development | production | testing
SECRET_KEY=your-secret-key     # Required for production
DEBUG=True                     # Enable debug mode

# File Handling
MAX_CONTENT_LENGTH=104857600   # Max upload size in bytes (100MB default)

# Database (optional - defaults to SQLite)
DATABASE_URL=sqlite:///history.db

Production Environment

FLASK_ENV=production
SECRET_KEY=your-very-secure-random-key-here
DEBUG=False
MAX_CONTENT_LENGTH=209715200   # 200MB for production

Security Warning

Never use the default SECRET_KEY in production. Generate a secure random key.

OCR Settings

OCR (Optical Character Recognition) extracts text from images and scanned documents.

Configuration Options

Setting	Type	Default	Description
`enabled`	boolean	`true`	Enable/disable OCR processing
`backend`	string	`"easyocr"`	OCR engine to use
`language`	string	`"en"`	Primary language for recognition
`force_full_page_ocr`	boolean	`false`	OCR entire page vs detected regions
`use_gpu`	boolean	`false`	Enable GPU acceleration (EasyOCR only)
`confidence_threshold`	float	`0.5`	Minimum confidence for results (0-1)
`bitmap_area_threshold`	float	`0.05`	Min area ratio for bitmap OCR (0-1)

OCR Backends

EasyOCRTesseractmacOS VisionRapidOCR

Best for multi-language documents with accuracy requirements.

{
  "ocr": {
    "backend": "easyocr",
    "use_gpu": true,
    "language": "en"
  }
}

GPU Support: Yes (CUDA)
Languages: 80+
Note: May have initialization issues on some systems

Classic, reliable OCR engine for simple documents.

{
  "ocr": {
    "backend": "tesseract",
    "language": "eng"
  }
}

GPU Support: No
Languages: 100+
Requires: Tesseract installed on system

Native macOS OCR using Apple's Vision framework.

{
  "ocr": {
    "backend": "ocrmac",
    "language": "en"
  }
}

GPU Support: Uses Apple Neural Engine
Requires: macOS 10.15+
Language codes: Duckling accepts short codes like en, de, fr and will normalize them to Vision locale tags (for example en-US) during conversion.

Fast, lightweight OCR using ONNX runtime.

{
  "ocr": {
    "backend": "rapidocr",
    "language": "en"
  }
}

GPU Support: No
Languages: Limited

Supported Languages

Code	Language	Code	Language
`en`	English	`ja`	Japanese
`de`	German	`zh`	Chinese (Simplified)
`fr`	French	`zh-tw`	Chinese (Traditional)
`es`	Spanish	`ko`	Korean
`it`	Italian	`ar`	Arabic
`pt`	Portuguese	`hi`	Hindi
`nl`	Dutch	`th`	Thai
`pl`	Polish	`vi`	Vietnamese
`ru`	Russian	`tr`	Turkish

Table Settings

Configure how tables are detected and extracted from documents.

Configuration Options

Setting	Type	Default	Description
`enabled`	boolean	`true`	Enable table detection
`structure_extraction`	boolean	`true`	Preserve table structure
`mode`	string	`"accurate"`	Detection mode
`do_cell_matching`	boolean	`true`	Match cell content to structure

Detection Modes

Accurate ModeFast Mode

{
  "tables": {
    "enabled": true,
    "mode": "accurate",
    "do_cell_matching": true
  }
}

Higher precision table detection
Better cell boundary recognition
Slower processing
Recommended for complex tables

{
  "tables": {
    "enabled": true,
    "mode": "fast",
    "do_cell_matching": false
  }
}

Faster processing
Good for simple tables
May miss complex structures

Image Settings

Configure image extraction and processing.

Configuration Options

Setting	Type	Default	Description
`extract`	boolean	`true`	Extract embedded images
`classify`	boolean	`true`	Classify and tag images
`generate_page_images`	boolean	`false`	Create images of each page
`generate_picture_images`	boolean	`true`	Extract pictures as files
`generate_table_images`	boolean	`true`	Extract tables as images
`images_scale`	float	`1.0`	Scale factor for images (0.1-4.0)

Example Configurations

High QualityMinimal (Text Only)

{
  "images": {
    "extract": true,
    "classify": true,
    "generate_page_images": true,
    "generate_picture_images": true,
    "generate_table_images": true,
    "images_scale": 2.0
  }
}

{
  "images": {
    "extract": false,
    "classify": false,
    "generate_page_images": false,
    "generate_picture_images": false,
    "generate_table_images": false
  }
}

Performance Settings

Optimize processing speed and resource usage.

Configuration Options

Setting	Type	Default	Description
`device`	string	`"auto"`	Processing device
`num_threads`	int	`4`	CPU threads (1-32)
`document_timeout`	int/null	`null`	Max processing time in seconds

Device Options

Device	Description	Best For
`auto`	Automatically select best device	General use
`cpu`	Force CPU processing	Servers without GPU
`cuda`	NVIDIA GPU acceleration	Linux/Windows with NVIDIA GPU
`mps`	Apple Metal Performance Shaders	macOS with Apple Silicon

Example Configurations

High Performance (GPU)Resource-ConstrainedApple Silicon

{
  "performance": {
    "device": "cuda",
    "num_threads": 8,
    "document_timeout": null
  }
}

{
  "performance": {
    "device": "cpu",
    "num_threads": 2,
    "document_timeout": 60
  }
}

{
  "performance": {
    "device": "mps",
    "num_threads": 4,
    "document_timeout": null
  }
}

Chunking Settings

Configure document chunking for RAG applications.

Configuration Options

Setting	Type	Default	Description
`enabled`	boolean	`false`	Enable document chunking
`max_tokens`	int	`512`	Maximum tokens per chunk
`merge_peers`	boolean	`true`	Merge undersized chunks

Example Configurations

RAG-OptimizedLarge Context Windows

{
  "chunking": {
    "enabled": true,
    "max_tokens": 512,
    "merge_peers": true
  }
}

{
  "chunking": {
    "enabled": true,
    "max_tokens": 2048,
    "merge_peers": false
  }
}

Output Settings

Configure default output format.

Setting	Type	Default	Description
`default_format`	string	`"markdown"`	Default export format

Complete Configuration Example

{
  "ocr": {
    "enabled": true,
    "backend": "easyocr",
    "language": "en",
    "force_full_page_ocr": false,
    "use_gpu": false,
    "confidence_threshold": 0.5,
    "bitmap_area_threshold": 0.05
  },
  "tables": {
    "enabled": true,
    "structure_extraction": true,
    "mode": "accurate",
    "do_cell_matching": true
  },
  "images": {
    "extract": true,
    "classify": true,
    "generate_page_images": false,
    "generate_picture_images": true,
    "generate_table_images": true,
    "images_scale": 1.0
  },
  "performance": {
    "device": "auto",
    "num_threads": 4,
    "document_timeout": null
  },
  "chunking": {
    "enabled": false,
    "max_tokens": 512,
    "merge_peers": true
  },
  "output": {
    "default_format": "markdown"
  }
}

Configuration via API

Get Current Settings

curl http://localhost:5001/api/settings

Update Settings

curl -X PUT http://localhost:5001/api/settings \
  -H "Content-Type: application/json" \
  -d '{
    "ocr": {"backend": "tesseract"},
    "performance": {"num_threads": 8}
  }'

Reset to Defaults

curl -X POST http://localhost:5001/api/settings/reset

Troubleshooting

OCR Not Working

EasyOCR initialization error: Switch to ocrmac (macOS) or tesseract
GPU errors: Set use_gpu: false
Low confidence results: Lower confidence_threshold

Slow Processing

Reduce images_scale to 0.5
Use mode: "fast" for tables
Disable generate_page_images
Increase num_threads

Memory Issues

Enable document_timeout (e.g., 120 seconds)
Process fewer files in batch
Reduce images_scale
Disable chunking if not needed