|
| 1 | +# Docling Serve backend (IBM Docling Workbench) |
| 2 | + |
| 3 | +## What this is for |
| 4 | + |
| 5 | +This backend connects Label Studio to **IBM Docling SaaS** using the Python **`DoclingServiceClient`** from the **`docling`** package (`from docling.service_client import DoclingServiceClient`). **Conversion runs on Docling’s servers**, not inside this container. For each task it resolves the file (usually via Label Studio–hosted storage), calls **`client.convert(source=…)`** with a local **`Path`** or an **`https://` URL string**, then maps **`result.document`** into **reactcode** predictions for the annotator. |
| 6 | + |
| 7 | +Use the **exact service URL** your tenant gives you (Integrate / Python snippet), including the path segment ending in **`/v1`**—for example |
| 8 | +`https://api.aws-c1.dcls.saas.ibm.com/<instance>/v1`. |
| 9 | + |
| 10 | +The **`docling`** `DoclingServiceClient` builds paths like **`/v1/convert/...`** on top of its `url=` argument. IBM’s URL already ends with **`/v1`**, which would otherwise produce **`…/v1/v1/…`** requests (404/400). This example **strips one trailing `/v1`** from **`DOCLING_SERVICE_URL`** before creating the client—keep pasting the Workbench value unchanged. |
| 11 | + |
| 12 | +Typical workflow: |
| 13 | + |
| 14 | +1. Tasks include a **file URL** (PDF, image, etc.)—often an upload or storage URL managed by Label Studio. |
| 15 | +2. Annotators run predictions (or batch predict); this ML backend fetches the file (unless you use remote-URL-only mode), calls **`DoclingServiceClient.convert`**, and returns layout as reactcode regions. |
| 16 | +3. Reviewers adjust regions or labels on top of Docling’s structure. |
| 17 | + |
| 18 | +You need the **full SaaS service URL** and API key from Workbench. Separately, the backend must often **download task files** through Label Studio when URLs point at your instance—see **Label Studio URL and API key** below. |
| 19 | + |
| 20 | +## Label Studio URL and API key |
| 21 | + |
| 22 | +Set **`LABEL_STUDIO_URL`** and **`LABEL_STUDIO_API_KEY`** in `docker-compose.yml` (or your shell) whenever tasks reference **files hosted by Label Studio**—uploads, cloud storage integrations, or other URLs that Label Studio resolves for the ML backend. |
| 23 | + |
| 24 | +By default it downloads to a cache path and passes a **`Path`** into **`convert`**. Set **`DOCLING_CONVERT_REMOTE_URL_ONLY=true`** to pass the task’s **`https://` URL** directly to SaaS (works only for URLs the Docling service can fetch without Label Studio auth). |
| 25 | + |
| 26 | +Practical notes: |
| 27 | + |
| 28 | +- **`LABEL_STUDIO_URL`** must be reachable **from where the ML backend runs**. From Docker on your laptop, **`http://localhost:8080`** usually does **not** work inside the container; use your machine’s hostname/IP, **`http://host.docker.internal:8080`** (Docker Desktop), or another URL the container can route to. This compose file includes `extra_hosts` for `host.docker.internal` on macOS/Linux-friendly setups. |
| 29 | +- **`LABEL_STUDIO_API_KEY`** should be a **Personal Access Token** (or equivalent) for a user that can read the project’s tasks and attachments. |
| 30 | + |
| 31 | +Always include **`http://` or `https://`** in `LABEL_STUDIO_URL`. More background is in the repository [README](../../../README.md) under allowing the ML backend to access Label Studio data. |
| 32 | + |
| 33 | +## Prerequisites |
| 34 | + |
| 35 | +1. **`DOCLING_SERVICE_URL`** — full URL ending in **`/v1`** from IBM Docling Workbench (same as `DoclingServiceClient(url=…)`). |
| 36 | +2. **`DOCLING_SERVE_API_KEY`** — API key for `X-Api-Key` (name kept for backward compatibility). |
| 37 | +3. **`LABEL_STUDIO_URL`** / **`LABEL_STUDIO_API_KEY`** when tasks use Label Studio–hosted files (typical for uploads). |
| 38 | + |
| 39 | +## Quick start (Docker) |
| 40 | + |
| 41 | +```bash |
| 42 | +cd label_studio_ml/examples/docling |
| 43 | +# Set DOCLING_SERVICE_URL, DOCLING_SERVE_API_KEY, LABEL_STUDIO_URL, LABEL_STUDIO_API_KEY in docker-compose.yml |
| 44 | +docker compose up --build |
| 45 | +``` |
| 46 | + |
| 47 | +The ML backend listens on **`http://localhost:9090`**. Register that URL in your Label Studio project’s machine learning settings. |
| 48 | + |
| 49 | +## Docling SaaS configuration |
| 50 | + |
| 51 | +| Variable | Required | Description | |
| 52 | +|----------|----------|-------------| |
| 53 | +| `DOCLING_SERVICE_URL` | Yes | Full **`DoclingServiceClient`** URL including path to **`/v1`** (fallback env name: `DOCLING_SERVE_URL`). | |
| 54 | +| `DOCLING_SERVE_API_KEY` | Often | API key (`X-Api-Key`). Alias: `DOCLING_API_KEY`. | |
| 55 | +| `DOCLING_CONVERT_REMOTE_URL_ONLY` | No | If `true`, pass the task **`https://` URL** as `convert(source=url)` instead of downloading via Label Studio first. | |
| 56 | +| `DOCLING_CONVERT_SOURCE_HEADERS_JSON` | No | Extra HTTP headers (JSON object) merged into **`convert`** when using remote URLs / headers the client supports. | |
| 57 | +| `DOCLING_SERVE_TIMEOUT` | No | Job / read timeout in seconds (default `600`). | |
| 58 | +| `DOCLING_HTTP_CONNECT_TIMEOUT` | No | Connect timeout (default `30`). | |
| 59 | + |
| 60 | +Optional tuning: `DOCLING_PAGE_NO`, `DOCLING_PREDICT_READING_ORDER`, `DOCLING_READING_ORDER_LEVEL`, `DOCLING_CONTENT_LAYERS`, `DOCLING_REACTCODE_FROM_NAME`, `DOCLING_REACTCODE_TO_NAME`, `DOCLING_TASK_DATA_KEY`. |
| 61 | + |
| 62 | +The **`docling`** PyPI package (**≥2.90**) provides **`DoclingServiceClient`**; behavior follows **your SaaS tenant**, not necessarily open-source Docling docs. |
| 63 | + |
| 64 | +## Label Studio configuration |
| 65 | + |
| 66 | +| Variable | Description | |
| 67 | +|----------|-------------| |
| 68 | +| `LABEL_STUDIO_URL` | Base URL of Label Studio, reachable from this backend (see above). | |
| 69 | +| `LABEL_STUDIO_API_KEY` | Token so the backend can download task attachments when needed. | |
| 70 | + |
| 71 | +Predictions are **`reactcode`** regions (rectangle / polyline payloads with percent coordinates), aligned with the Label Studio Enterprise ReactCode UI—see **`docling_labeling_config.xml`** in this folder. |
| 72 | + |
| 73 | +## Running locally (without Docker) |
| 74 | + |
| 75 | +```bash |
| 76 | +pip install -r requirements-base.txt -r requirements.txt |
| 77 | +export DOCLING_SERVICE_URL=https://api.aws-c1.dcls.saas.ibm.com/your-instance/v1 |
| 78 | +export DOCLING_SERVE_API_KEY=your-api-key |
| 79 | +export LABEL_STUDIO_URL=http://host.docker.internal:8080 |
| 80 | +export LABEL_STUDIO_API_KEY=your-label-studio-token |
| 81 | +python _wsgi.py -p 9090 |
| 82 | +``` |
| 83 | + |
| 84 | +Adjust `LABEL_STUDIO_URL` if Label Studio runs on the same machine without Docker (for example `http://127.0.0.1:8080`). |
| 85 | + |
| 86 | +## Validate |
| 87 | + |
| 88 | +```bash |
| 89 | +curl http://localhost:9090/ |
| 90 | +``` |
| 91 | + |
| 92 | +Expected: `{"status":"UP"}`. |
| 93 | + |
| 94 | +## Troubleshooting |
| 95 | + |
| 96 | +### Wrong SaaS URL |
| 97 | + |
| 98 | +**`DOCLING_SERVICE_URL`** must match the URL Workbench gives you (through **`/v1`**). The backend normalizes it so routes are not doubled—see the note above if you see **`/v1/v1/`** in logs. |
| 99 | + |
| 100 | +### No predictions / “nothing happens” (no errors in the UI) |
| 101 | + |
| 102 | +Label Studio often shows **no message** when the ML backend returns **empty `results`** (HTTP 200 with an empty list). Check **Docker logs** for this container: |
| 103 | + |
| 104 | +```bash |
| 105 | +docker compose logs -f docling |
| 106 | +``` |
| 107 | + |
| 108 | +You should see a line like **`Docling predict: N task(s)`** whenever you run predictions. If you see **`Docling produced zero predictions`**, scroll up in the same log for **`No file URL found`** or Docling **`API error`** lines. |
| 109 | + |
| 110 | +Common fixes: |
| 111 | + |
| 112 | +1. **Placeholder URL** — Replace **`YOUR_INSTANCE_SEGMENT`** in **`DOCLING_SERVICE_URL`** with the real path from Workbench. |
| 113 | +2. **Wrong task field** — Tasks must expose a **file URL** under the key your labeling config expects (often **`undefined`**). Override with **`DOCLING_TASK_DATA_KEY`** if needed. |
| 114 | +3. **`LOG_LEVEL`** — Defaults to **`INFO`** in `_wsgi.py` when unset. |
| 115 | +4. **Upload / `/storage-data/` URLs** — `model.py` downloads via **`label_studio_sdk`** using **`LABEL_STUDIO_URL`** (same **scheme + host + port** as in your browser; wrong host breaks auth headers), **`LABEL_STUDIO_API_KEY`**, and network reachability from this container (`host.docker.internal` instead of `localhost` on Docker Desktop). Self-signed HTTPS: set **`VERIFY_SSL=false`** on the ML backend. Logs now include **HTTP status / snippet** when the download fails. |
| 116 | + |
| 117 | +Sanity checks: |
| 118 | + |
| 119 | +```bash |
| 120 | +curl -s http://localhost:9090/health |
| 121 | +curl -s http://localhost:9090/ |
| 122 | +``` |
| 123 | + |
| 124 | +Both should return JSON including **`"status":"UP"`**. |
| 125 | + |
| 126 | +### Empty or tiny downloaded files |
| 127 | + |
| 128 | +Check **`LABEL_STUDIO_URL`** / **`LABEL_STUDIO_API_KEY`** and logs for `Docling task … local_path=… size=…`. A size of **0** or failed stat (`-1` in logs) usually means the file did not download correctly before conversion. |
| 129 | + |
| 130 | +## Layout of this example |
| 131 | + |
| 132 | +Like other backends under `label_studio_ml/examples/` (for example `easyocr/`), this directory includes `_wsgi.py`, `model.py`, `requirements-base.txt`, `requirements.txt`, `Dockerfile`, `docker-compose.yml`, and tests. **`docker-compose.yml`** bind-mounts `./data/server` and `./data/.file-cache` for runtime caches; Docker creates those paths on the host when you first run Compose—they are not checked into git (see `.gitignore`). |
0 commit comments