Conversion API¶
Endpoints for uploading and converting documents.
Upload and Convert Single Document¶
Parameters¶
| Name | Type | Required | Description |
|---|---|---|---|
file | File | Yes | Document to convert |
settings | JSON string | No | Conversion settings override |
Example Request¶
curl -X POST http://localhost:5001/api/convert \
-F "file=@document.pdf" \
-F 'settings={"ocr":{"enabled":true,"language":"en"}}'
Response (202 Accepted)¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "document.pdf",
"input_format": "pdf",
"status": "processing",
"message": "Conversion started"
}
Batch Convert Multiple Documents¶
Parameters¶
| Name | Type | Required | Description |
|---|---|---|---|
files | File[] | Yes | Documents to convert |
settings | JSON string | No | Conversion settings override |
Example Request¶
curl -X POST http://localhost:5001/api/convert/batch \
-F "files=@doc1.pdf" \
-F "files=@doc2.pdf" \
-F "files=@image.png"
Response (202 Accepted)¶
{
"jobs": [
{
"job_id": "550e8400-e29b-41d4-a716-446655440001",
"filename": "doc1.pdf",
"input_format": "pdf",
"status": "processing"
},
{
"job_id": "550e8400-e29b-41d4-a716-446655440002",
"filename": "doc2.pdf",
"input_format": "pdf",
"status": "processing"
},
{
"job_id": "550e8400-e29b-41d4-a716-446655440003",
"filename": "image.png",
"input_format": "image",
"status": "processing"
}
],
"total": 3,
"message": "Started 3 conversions"
}
Convert Document from URL¶
Parameters¶
| Name | Type | Required | Description |
|---|---|---|---|
url | string | Yes | URL of the document to convert |
settings | object | No | Conversion settings override |
Example Request¶
curl -X POST http://localhost:5001/api/convert/url \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/document.pdf",
"settings": {"ocr": {"enabled": true}}
}'
Response (202 Accepted)¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "document.pdf",
"source_url": "https://example.com/document.pdf",
"input_format": "pdf",
"status": "processing",
"message": "Conversion started"
}
Batch Convert Documents from URLs¶
Parameters¶
| Name | Type | Required | Description |
|---|---|---|---|
urls | string[] | Yes | Array of URLs to convert |
settings | object | No | Conversion settings override |
Example Request¶
curl -X POST http://localhost:5001/api/convert/url/batch \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com/doc1.pdf",
"https://example.com/doc2.docx",
"https://example.com/page.html"
]
}'
Response (202 Accepted)¶
{
"jobs": [
{
"job_id": "550e8400-e29b-41d4-a716-446655440001",
"url": "https://example.com/doc1.pdf",
"filename": "doc1.pdf",
"input_format": "pdf",
"status": "processing"
},
{
"job_id": "550e8400-e29b-41d4-a716-446655440002",
"url": "https://example.com/doc2.docx",
"filename": "doc2.docx",
"input_format": "docx",
"status": "processing"
},
{
"url": "https://example.com/invalid",
"status": "rejected",
"error": "File type not allowed"
}
],
"total": 3,
"message": "Started 2 conversions"
}
Get Conversion Status¶
Response (Processing)¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing",
"progress": 45,
"message": "Analyzing document with OCR (easyocr, en)..."
}
Response (Completed)¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"progress": 100,
"message": "Conversion completed successfully",
"confidence": 0.92,
"formats_available": ["markdown", "html", "json", "text", "doctags"],
"images_count": 3,
"tables_count": 2,
"chunks_count": 0,
"preview": "# Document Title\n\nFirst paragraph..."
}
Response (Failed)¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "failed",
"progress": 0,
"message": "Conversion failed: Invalid PDF format",
"error": "Invalid PDF format"
}
Get Conversion Result¶
Response¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"confidence": 0.92,
"formats_available": ["markdown", "html", "json", "text", "doctags", "document_tokens"],
"result": {
"markdown_preview": "# Document Title\n\nContent preview...",
"formats_available": ["markdown", "html", "json", "text", "doctags"],
"page_count": 5,
"images_count": 3,
"tables_count": 2,
"chunks_count": 0,
"warnings": []
},
"images_count": 3,
"tables_count": 2,
"chunks_count": 0,
"completed_at": "2024-01-15T10:30:00Z"
}
Get Extracted Images¶
Response¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"images": [
{
"id": 1,
"filename": "image_1.png",
"path": "/outputs/job_id/images/image_1.png",
"caption": "Figure 1: Architecture diagram",
"label": "figure"
},
{
"id": 2,
"filename": "image_2.png",
"path": "/outputs/job_id/images/image_2.png",
"caption": "",
"label": "picture"
}
],
"count": 2
}
Download Extracted Image¶
Response: Binary image file (PNG)
Get Extracted Tables¶
Response¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"tables": [
{
"id": 1,
"label": "table",
"caption": "Table 1: Sales Data",
"rows": [
["Product", "Q1", "Q2", "Q3", "Q4"],
["Widget A", "100", "150", "200", "175"]
],
"csv_path": "/outputs/job_id/tables/table_1.csv",
"image_path": "/outputs/job_id/tables/table_1.png"
}
],
"count": 1
}
Download Table as CSV¶
Response: CSV file
Download Table as Image¶
Response: Binary image file (PNG)
Get Document Chunks¶
Response¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"chunks": [
{
"id": 1,
"text": "This is the first chunk of text from the document...",
"meta": {
"headings": ["Introduction"],
"page": 1
}
},
{
"id": 2,
"text": "Second chunk continues the content...",
"meta": {
"headings": ["Introduction", "Background"],
"page": 1
}
}
],
"count": 2
}
Export Document¶
Supported Formats¶
markdownhtmljsontextdoctagsdocument_tokenschunks
Response: File download with appropriate MIME type