Skip to content

Commit d45dfe8

Browse files
feat: Docling SaaS ML backend example (#889)
1 parent 55c5a07 commit d45dfe8

14 files changed

Lines changed: 1129 additions & 0 deletions

File tree

.github/workflows/tests.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,9 @@ jobs:
4444
LOG_DIR: pytest_logs
4545
collect_analytics: false
4646
TEST_WITH_CPU: ${{ matrix.backend_dir_name == 'segment_anything_model' }}
47+
# Keep the repo on pytest 6.x while preventing newer transitive anyio from auto-loading
48+
# a pytest plugin that imports _pytest.scope (pytest 8+ only).
49+
PYTEST_ADDOPTS: "-p no:anyio"
4750
steps:
4851
- uses: hmarr/debug-action@v3.0.0
4952

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ Check the **Required parameters** column to see if you need to set any additiona
4343
| MODEL_NAME | Description | Pre-annotation | Interactive mode | Training | Required parameters | Arbitrary or Set Labels? |
4444
|--------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|------------------|----------|----------------------|----------------------------------------------------------------------------|
4545
| [bert_classifier](/label_studio_ml/examples/bert_classifier) | Text classification with [Huggingface](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification) |||| None | Arbitrary|
46+
| [docling](/label_studio_ml/examples/docling) | Layout via [IBM Docling SaaS](https://www.ibm.com/products/docling) (`DoclingServiceClient`) → ReactCode regions |||| DOCLING_SERVICE_URL, DOCLING_SERVE_API_KEY, LABEL_STUDIO_URL (uploads) | Set (layout categories) |
4647
| [easyocr](/label_studio_ml/examples/easyocr) | Automated OCR. [EasyOCR](https://github.com/JaidedAI/EasyOCR) |||| None | Set (characters) |
4748
| [flair](/label_studio_ml/examples/flair) | NER by [flair](https://flairnlp.github.io/) |||| None | Arbitrary|
4849
| [gliner](/label_studio_ml/examples/gliner) | NER by [GLiNER](https://huggingface.co/spaces/tomaarsen/gliner_medium-v2.1) |||| None | Arbitrary|
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Exclude everything
2+
**
3+
4+
# Build / compose
5+
!Dockerfile
6+
!docker-compose.yml
7+
8+
# Application
9+
!*.py
10+
11+
# Requirements
12+
!requirements*.txt
13+
14+
# Tests (optional in image when TEST_ENV=true)
15+
!test_api.py
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Docker Compose bind mounts (created on first `docker compose up`)
2+
data/
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# BuildKit required for RUN --mount below. Omitting `# syntax=docker/dockerfile:1` avoids extra frontend pulls that can trigger grpc errors on some hosts.
2+
ARG PYTHON_VERSION=3.11
3+
4+
FROM python:${PYTHON_VERSION}-slim-bookworm AS python-base
5+
ARG TEST_ENV
6+
7+
WORKDIR /app
8+
9+
ENV PYTHONUNBUFFERED=1 \
10+
PYTHONDONTWRITEBYTECODE=1 \
11+
PORT=${PORT:-9090} \
12+
PIP_CACHE_DIR=/.cache \
13+
WORKERS=1 \
14+
THREADS=8 \
15+
PIP_ROOT_USER_ACTION=ignore \
16+
DEBIAN_FRONTEND=noninteractive
17+
18+
RUN apt-get update \
19+
&& apt-get install -y --no-install-recommends \
20+
libgl1 libglib2.0-0 curl wget git procps \
21+
&& rm -rf /var/lib/apt/lists/*
22+
23+
COPY requirements-base.txt .
24+
RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
25+
pip install -r requirements-base.txt
26+
27+
COPY requirements.txt .
28+
RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
29+
pip install -r requirements.txt
30+
31+
COPY requirements-test.txt .
32+
RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
33+
if [ "$TEST_ENV" = "true" ]; then \
34+
pip install -r requirements-test.txt; \
35+
fi
36+
37+
COPY . .
38+
39+
EXPOSE 9090
40+
41+
CMD gunicorn --preload --bind :$PORT --workers $WORKERS --threads $THREADS --timeout 0 _wsgi:app
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Docling Serve backend (IBM Docling Workbench)
2+
3+
## What this is for
4+
5+
This backend connects Label Studio to **IBM Docling SaaS** using the Python **`DoclingServiceClient`** from the **`docling`** package (`from docling.service_client import DoclingServiceClient`). **Conversion runs on Docling’s servers**, not inside this container. For each task it resolves the file (usually via Label Studio–hosted storage), calls **`client.convert(source=…)`** with a local **`Path`** or an **`https://` URL string**, then maps **`result.document`** into **reactcode** predictions for the annotator.
6+
7+
Use the **exact service URL** your tenant gives you (Integrate / Python snippet), including the path segment ending in **`/v1`**—for example
8+
`https://api.aws-c1.dcls.saas.ibm.com/<instance>/v1`.
9+
10+
The **`docling`** `DoclingServiceClient` builds paths like **`/v1/convert/...`** on top of its `url=` argument. IBM’s URL already ends with **`/v1`**, which would otherwise produce **`…/v1/v1/…`** requests (404/400). This example **strips one trailing `/v1`** from **`DOCLING_SERVICE_URL`** before creating the client—keep pasting the Workbench value unchanged.
11+
12+
Typical workflow:
13+
14+
1. Tasks include a **file URL** (PDF, image, etc.)—often an upload or storage URL managed by Label Studio.
15+
2. Annotators run predictions (or batch predict); this ML backend fetches the file (unless you use remote-URL-only mode), calls **`DoclingServiceClient.convert`**, and returns layout as reactcode regions.
16+
3. Reviewers adjust regions or labels on top of Docling’s structure.
17+
18+
You need the **full SaaS service URL** and API key from Workbench. Separately, the backend must often **download task files** through Label Studio when URLs point at your instance—see **Label Studio URL and API key** below.
19+
20+
## Label Studio URL and API key
21+
22+
Set **`LABEL_STUDIO_URL`** and **`LABEL_STUDIO_API_KEY`** in `docker-compose.yml` (or your shell) whenever tasks reference **files hosted by Label Studio**—uploads, cloud storage integrations, or other URLs that Label Studio resolves for the ML backend.
23+
24+
By default it downloads to a cache path and passes a **`Path`** into **`convert`**. Set **`DOCLING_CONVERT_REMOTE_URL_ONLY=true`** to pass the task’s **`https://` URL** directly to SaaS (works only for URLs the Docling service can fetch without Label Studio auth).
25+
26+
Practical notes:
27+
28+
- **`LABEL_STUDIO_URL`** must be reachable **from where the ML backend runs**. From Docker on your laptop, **`http://localhost:8080`** usually does **not** work inside the container; use your machine’s hostname/IP, **`http://host.docker.internal:8080`** (Docker Desktop), or another URL the container can route to. This compose file includes `extra_hosts` for `host.docker.internal` on macOS/Linux-friendly setups.
29+
- **`LABEL_STUDIO_API_KEY`** should be a **Personal Access Token** (or equivalent) for a user that can read the project’s tasks and attachments.
30+
31+
Always include **`http://` or `https://`** in `LABEL_STUDIO_URL`. More background is in the repository [README](../../../README.md) under allowing the ML backend to access Label Studio data.
32+
33+
## Prerequisites
34+
35+
1. **`DOCLING_SERVICE_URL`** — full URL ending in **`/v1`** from IBM Docling Workbench (same as `DoclingServiceClient(url=…)`).
36+
2. **`DOCLING_SERVE_API_KEY`** — API key for `X-Api-Key` (name kept for backward compatibility).
37+
3. **`LABEL_STUDIO_URL`** / **`LABEL_STUDIO_API_KEY`** when tasks use Label Studio–hosted files (typical for uploads).
38+
39+
## Quick start (Docker)
40+
41+
```bash
42+
cd label_studio_ml/examples/docling
43+
# Set DOCLING_SERVICE_URL, DOCLING_SERVE_API_KEY, LABEL_STUDIO_URL, LABEL_STUDIO_API_KEY in docker-compose.yml
44+
docker compose up --build
45+
```
46+
47+
The ML backend listens on **`http://localhost:9090`**. Register that URL in your Label Studio project’s machine learning settings.
48+
49+
## Docling SaaS configuration
50+
51+
| Variable | Required | Description |
52+
|----------|----------|-------------|
53+
| `DOCLING_SERVICE_URL` | Yes | Full **`DoclingServiceClient`** URL including path to **`/v1`** (fallback env name: `DOCLING_SERVE_URL`). |
54+
| `DOCLING_SERVE_API_KEY` | Often | API key (`X-Api-Key`). Alias: `DOCLING_API_KEY`. |
55+
| `DOCLING_CONVERT_REMOTE_URL_ONLY` | No | If `true`, pass the task **`https://` URL** as `convert(source=url)` instead of downloading via Label Studio first. |
56+
| `DOCLING_CONVERT_SOURCE_HEADERS_JSON` | No | Extra HTTP headers (JSON object) merged into **`convert`** when using remote URLs / headers the client supports. |
57+
| `DOCLING_SERVE_TIMEOUT` | No | Job / read timeout in seconds (default `600`). |
58+
| `DOCLING_HTTP_CONNECT_TIMEOUT` | No | Connect timeout (default `30`). |
59+
60+
Optional tuning: `DOCLING_PAGE_NO`, `DOCLING_PREDICT_READING_ORDER`, `DOCLING_READING_ORDER_LEVEL`, `DOCLING_CONTENT_LAYERS`, `DOCLING_REACTCODE_FROM_NAME`, `DOCLING_REACTCODE_TO_NAME`, `DOCLING_TASK_DATA_KEY`.
61+
62+
The **`docling`** PyPI package (**≥2.90**) provides **`DoclingServiceClient`**; behavior follows **your SaaS tenant**, not necessarily open-source Docling docs.
63+
64+
## Label Studio configuration
65+
66+
| Variable | Description |
67+
|----------|-------------|
68+
| `LABEL_STUDIO_URL` | Base URL of Label Studio, reachable from this backend (see above). |
69+
| `LABEL_STUDIO_API_KEY` | Token so the backend can download task attachments when needed. |
70+
71+
Predictions are **`reactcode`** regions (rectangle / polyline payloads with percent coordinates), aligned with the Label Studio Enterprise ReactCode UI—see **`docling_labeling_config.xml`** in this folder.
72+
73+
## Running locally (without Docker)
74+
75+
```bash
76+
pip install -r requirements-base.txt -r requirements.txt
77+
export DOCLING_SERVICE_URL=https://api.aws-c1.dcls.saas.ibm.com/your-instance/v1
78+
export DOCLING_SERVE_API_KEY=your-api-key
79+
export LABEL_STUDIO_URL=http://host.docker.internal:8080
80+
export LABEL_STUDIO_API_KEY=your-label-studio-token
81+
python _wsgi.py -p 9090
82+
```
83+
84+
Adjust `LABEL_STUDIO_URL` if Label Studio runs on the same machine without Docker (for example `http://127.0.0.1:8080`).
85+
86+
## Validate
87+
88+
```bash
89+
curl http://localhost:9090/
90+
```
91+
92+
Expected: `{"status":"UP"}`.
93+
94+
## Troubleshooting
95+
96+
### Wrong SaaS URL
97+
98+
**`DOCLING_SERVICE_URL`** must match the URL Workbench gives you (through **`/v1`**). The backend normalizes it so routes are not doubled—see the note above if you see **`/v1/v1/`** in logs.
99+
100+
### No predictions / “nothing happens” (no errors in the UI)
101+
102+
Label Studio often shows **no message** when the ML backend returns **empty `results`** (HTTP 200 with an empty list). Check **Docker logs** for this container:
103+
104+
```bash
105+
docker compose logs -f docling
106+
```
107+
108+
You should see a line like **`Docling predict: N task(s)`** whenever you run predictions. If you see **`Docling produced zero predictions`**, scroll up in the same log for **`No file URL found`** or Docling **`API error`** lines.
109+
110+
Common fixes:
111+
112+
1. **Placeholder URL** — Replace **`YOUR_INSTANCE_SEGMENT`** in **`DOCLING_SERVICE_URL`** with the real path from Workbench.
113+
2. **Wrong task field** — Tasks must expose a **file URL** under the key your labeling config expects (often **`undefined`**). Override with **`DOCLING_TASK_DATA_KEY`** if needed.
114+
3. **`LOG_LEVEL`** — Defaults to **`INFO`** in `_wsgi.py` when unset.
115+
4. **Upload / `/storage-data/` URLs**`model.py` downloads via **`label_studio_sdk`** using **`LABEL_STUDIO_URL`** (same **scheme + host + port** as in your browser; wrong host breaks auth headers), **`LABEL_STUDIO_API_KEY`**, and network reachability from this container (`host.docker.internal` instead of `localhost` on Docker Desktop). Self-signed HTTPS: set **`VERIFY_SSL=false`** on the ML backend. Logs now include **HTTP status / snippet** when the download fails.
116+
117+
Sanity checks:
118+
119+
```bash
120+
curl -s http://localhost:9090/health
121+
curl -s http://localhost:9090/
122+
```
123+
124+
Both should return JSON including **`"status":"UP"`**.
125+
126+
### Empty or tiny downloaded files
127+
128+
Check **`LABEL_STUDIO_URL`** / **`LABEL_STUDIO_API_KEY`** and logs for `Docling task … local_path=… size=…`. A size of **0** or failed stat (`-1` in logs) usually means the file did not download correctly before conversion.
129+
130+
## Layout of this example
131+
132+
Like other backends under `label_studio_ml/examples/` (for example `easyocr/`), this directory includes `_wsgi.py`, `model.py`, `requirements-base.txt`, `requirements.txt`, `Dockerfile`, `docker-compose.yml`, and tests. **`docker-compose.yml`** bind-mounts `./data/server` and `./data/.file-cache` for runtime caches; Docker creates those paths on the host when you first run Compose—they are not checked into git (see `.gitignore`).
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
import os
2+
import argparse
3+
import json
4+
import logging
5+
import logging.config
6+
7+
_LOG_LEVEL = (os.getenv("LOG_LEVEL") or "INFO").upper()
8+
9+
logging.config.dictConfig({
10+
"version": 1,
11+
"disable_existing_loggers": False,
12+
"formatters": {
13+
"standard": {
14+
"format": "[%(asctime)s] [%(levelname)s] [%(name)s::%(funcName)s::%(lineno)d] %(message)s"
15+
}
16+
},
17+
"handlers": {
18+
"console": {
19+
"class": "logging.StreamHandler",
20+
"level": _LOG_LEVEL,
21+
"stream": "ext://sys.stdout",
22+
"formatter": "standard"
23+
}
24+
},
25+
"root": {
26+
"level": _LOG_LEVEL,
27+
"handlers": [
28+
"console"
29+
],
30+
"propagate": True
31+
}
32+
})
33+
34+
from label_studio_ml.api import init_app
35+
from model import Docling
36+
37+
38+
_DEFAULT_CONFIG_PATH = os.path.join(os.path.dirname(__file__), 'config.json')
39+
40+
41+
def get_kwargs_from_config(config_path=_DEFAULT_CONFIG_PATH):
42+
if not os.path.exists(config_path):
43+
return dict()
44+
with open(config_path) as f:
45+
config = json.load(f)
46+
assert isinstance(config, dict)
47+
return config
48+
49+
50+
if __name__ == "__main__":
51+
parser = argparse.ArgumentParser(description='Label studio')
52+
parser.add_argument(
53+
'-p', '--port', dest='port', type=int, default=9090,
54+
help='Server port')
55+
parser.add_argument(
56+
'--host', dest='host', type=str, default='0.0.0.0',
57+
help='Server host')
58+
parser.add_argument(
59+
'--kwargs', '--with', dest='kwargs', metavar='KEY=VAL', nargs='+', type=lambda kv: kv.split('='),
60+
help='Additional LabelStudioMLBase model initialization kwargs')
61+
parser.add_argument(
62+
'-d', '--debug', dest='debug', action='store_true',
63+
help='Switch debug mode')
64+
parser.add_argument(
65+
'--log-level', dest='log_level', choices=['DEBUG', 'INFO', 'WARNING', 'ERROR'], default=None,
66+
help='Logging level')
67+
parser.add_argument(
68+
'--model-dir', dest='model_dir', default=os.path.dirname(__file__),
69+
help='Directory where models are stored (relative to the project directory)')
70+
parser.add_argument(
71+
'--check', dest='check', action='store_true',
72+
help='Validate model instance before launching server')
73+
parser.add_argument('--basic-auth-user',
74+
default=os.environ.get('ML_SERVER_BASIC_AUTH_USER', None),
75+
help='Basic auth user')
76+
77+
parser.add_argument('--basic-auth-pass',
78+
default=os.environ.get('ML_SERVER_BASIC_AUTH_PASS', None),
79+
help='Basic auth pass')
80+
81+
args = parser.parse_args()
82+
83+
# setup logging level
84+
if args.log_level:
85+
logging.root.setLevel(args.log_level)
86+
87+
def isfloat(value):
88+
try:
89+
float(value)
90+
return True
91+
except ValueError:
92+
return False
93+
94+
def parse_kwargs():
95+
param = dict()
96+
for k, v in args.kwargs:
97+
if v.isdigit():
98+
param[k] = int(v)
99+
elif v == 'True' or v == 'true':
100+
param[k] = True
101+
elif v == 'False' or v == 'false':
102+
param[k] = False
103+
elif isfloat(v):
104+
param[k] = float(v)
105+
else:
106+
param[k] = v
107+
return param
108+
109+
kwargs = get_kwargs_from_config()
110+
111+
if args.kwargs:
112+
kwargs.update(parse_kwargs())
113+
114+
if args.check:
115+
print('Check "' + Docling.__name__ + '" instance creation..')
116+
model = Docling(**kwargs)
117+
118+
app = init_app(model_class=Docling, basic_auth_user=args.basic_auth_user, basic_auth_pass=args.basic_auth_pass)
119+
120+
app.run(host=args.host, port=args.port, debug=args.debug)
121+
122+
else:
123+
# for uWSGI use
124+
app = init_app(model_class=Docling)

0 commit comments

Comments
 (0)