Supported Formats

Complete reference for input and output formats supported by Duckling.

Input Formats

Documents

Format	Extensions	Description	Notes
PDF	`.pdf`	Portable Document Format	Full support including scanned PDFs with OCR
Word	`.docx`	Microsoft Word	Modern format only (not `.doc`)
PowerPoint	`.pptx`	Microsoft PowerPoint	Extracts text and images from slides
Excel	`.xlsx`	Microsoft Excel	Extracts tables and data
HTML	`.html`, `.htm`	Web pages	Preserves structure and formatting
Markdown	`.md`, `.markdown`	Markdown files	Full CommonMark support

Images

Format	Extensions	Description	Notes
PNG	`.png`	Portable Network Graphics	Best for screenshots and diagrams
JPEG	`.jpg`, `.jpeg`	Joint Photographic Experts Group	Best for photos
TIFF	`.tiff`, `.tif`	Tagged Image File Format	Multi-page support
GIF	`.gif`	Graphics Interchange Format	First frame only
WebP	`.webp`	Web Picture format	Modern web format
BMP	`.bmp`	Bitmap	Uncompressed images

Technical Documents

Format	Extensions	Description	Notes
AsciiDoc	`.asciidoc`, `.adoc`	Technical documentation	Full AsciiDoc syntax
PubMed XML	`.xml`	Scientific articles	PubMed Central format
USPTO XML	`.xml`	Patent documents	US Patent format

Output Formats

Text Formats

Markdown (`.md`)

Best for documentation and content that needs formatting.

# Document Title

## Section 1

This is a paragraph with **bold** and *italic* text.

- List item 1
- List item 2

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |

HTML (`.html`)

Web-ready format with styling preserved.

<h1>Document Title</h1>
<h2>Section 1</h2>
<p>This is a paragraph with <strong>bold</strong> and <em>italic</em> text.</p>

Plain Text (`.txt`)

Simple text without any formatting.

Document Title

Section 1

This is a paragraph with bold and italic text.

Structured Formats

JSON (`.json`)

Full document structure in JSON format. Lossless representation.

{
  "title": "Document Title",
  "sections": [
    {
      "heading": "Section 1",
      "level": 2,
      "content": [
        {
          "type": "paragraph",
          "text": "This is a paragraph..."
        }
      ]
    }
  ]
}

DocTags (`.doctags`)

Tagged document format for semantic analysis.

<document>
  <title>Document Title</title>
  <section level="2">
    <heading>Section 1</heading>
    <paragraph>This is a paragraph...</paragraph>
  </section>
</document>

Document Tokens (`.tokens.json`)

Token-level representation for NLP applications.

{
  "tokens": [
    {"text": "Document", "type": "word", "position": 0},
    {"text": "Title", "type": "word", "position": 1}
  ]
}

RAG Formats

RAG Chunks (`.chunks.json`)

Document chunks optimized for retrieval-augmented generation.

{
  "chunks": [
    {
      "id": 1,
      "text": "This is the first chunk of text...",
      "meta": {
        "headings": ["Section 1"],
        "page": 1,
        "token_count": 128
      }
    }
  ]
}

Format Selection Guide

Use Case	Recommended Format
Documentation	Markdown
Web publishing	HTML
Data processing	JSON
Search indexing	Plain Text
NLP/ML pipelines	Document Tokens
RAG applications	RAG Chunks
Semantic analysis	DocTags

API Format Parameter

When using the API, specify the format in the export endpoint:

# Download as Markdown
curl http://localhost:5001/api/export/{job_id}/markdown

# Download as JSON
curl http://localhost:5001/api/export/{job_id}/json

# Download as HTML
curl http://localhost:5001/api/export/{job_id}/html

MIME Types

Format	MIME Type
Markdown	`text/markdown`
HTML	`text/html`
JSON	`application/json`
Plain Text	`text/plain`
DocTags	`application/xml`