Skip to content

feat: Add presigned artifact targets and internal result model cleanup#155

Draft
cau-git wants to merge 5 commits into
mainfrom
cau/presigned-url-target-result-models
Draft

feat: Add presigned artifact targets and internal result model cleanup#155
cau-git wants to merge 5 commits into
mainfrom
cau/presigned-url-target-result-models

Conversation

@cau-git
Copy link
Copy Markdown
Member

@cau-git cau-git commented May 21, 2026

This change prepares docling-jobkit for server-managed artifact delivery while cleaning up the internal result model boundary. The main goal is to let service/orchestrated execution paths write converted outputs to managed storage and return artifact references without exposing storage credentials, while keeping existing response shapes stable for current consumers.

  • Add support for PresignedUrlTarget in docling-jobkit, including server-side target configuration, an S3-backed presigned artifact processor, and orchestrator wiring for local, RQ, and Ray execution paths.
  • Materialize converted outputs into per-document artifacts, upload them to managed storage, and return PresignedArtifactResult with typed artifact metadata, presigned URLs, and server-controlled expiration.
  • Reuse the shared service datamodel from docling, including PresignedUrlTarget, ArtifactRef, DocumentArtifactItem, DocumentResultItem, and PresignedArtifactResult, so jobkit stays aligned with the pinned upstream service models.
  • Clean up the internal domain/wire boundary by building DocumentResultItem internally and mapping to legacy wire models via shared helpers, instead of constructing ExportResult directly in production conversion and chunking code.
  • Refactor export materialization so zip/put flows and presigned-target flows share one artifact creation path, reducing drift risk between file export behaviors.
  • Explicitly reject presigned_url targets in the legacy local and multiprocess CLI paths, since those paths do not have the task and server target configuration required to generate managed artifact URLs.
  • Add focused tests covering the new presigned result flow, shared-model shims, CLI validation, and the touched conversion/orchestrator code paths.

see also: docling-project/docling#3486

- support  in jobkit with server-managed S3 uploads and presigned artifact results
- wire target configuration through local, RQ, and Ray orchestrators and reject  in legacy CLI paths
- switch conversion and chunking flows to build  internally and map to shared wire models
- unify export materialization across zip/put and presigned flows and add focused regression tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git marked this pull request as draft May 21, 2026 19:47
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 21, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

cau-git added 2 commits May 21, 2026 22:22
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git force-pushed the cau/presigned-url-target-result-models branch from bae9255 to 4eaf1ac Compare May 22, 2026 11:31
Comment thread docling_jobkit/cli/validation.py Outdated
from docling.datamodel.service.targets import PresignedUrlTarget


def ensure_legacy_target_supported(target: object) -> None:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be handled simpler than requiring this check.

Comment thread docling_jobkit/cli/local.py Outdated
config = JobConfig(**raw_data)
except FileNotFoundError:
typer.echo(f"❌ File not found: {config_file}")
raise typer.Exit(1)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was typer.Exit(1) added, isnt' that out of scope?

Comment thread docling_jobkit/config/target_config.py Outdated
url_expiration: int = Field(default=3600, ge=60, le=604800)


class TargetConfig(BaseModel):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a stub class, why needed?

Comment thread docling_jobkit/config/target_config.py Outdated
Comment on lines +9 to +13
date_partition_format: str = "%Y%m%d"
include_tenant_in_path: bool = True
include_task_id_in_path: bool = True
attach_metadata: bool = True
metadata_fields: list[str] = ["tenant_id", "user_id", "project_id"]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like overengineered configurability.

ArtifactType = Literal["json", "html", "markdown", "text", "doctags", "resource_bundle"]


class S3PresignedTargetProcessor(BaseTargetProcessor):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an entire implementation for this S3PresignedTargetProcessor repeating all logic. There is already an S3TargetProcessor which one could factor things from to share.

Comment on lines +27 to +34
if isinstance(target, PresignedUrlTarget):
if target_config is None or target_config.s3_presigned is None:
raise ValueError(
"PresignedUrlTarget requires TargetConfig.s3_presigned in orchestrator config"
)
if task is None:
raise ValueError("PresignedUrlTarget requires the current task context")
return S3PresignedTargetProcessor(target_config.s3_presigned, task)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why this logic is needed and what the extra arguments do.

Comment thread docling_jobkit/convert/chunking.py Outdated
status=exportable_document.status,
timings=exportable_document.timings,
errors=errors,
doc_result = _to_export_result(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_to_export_result was removed as there should be no need after the data models are symmetric now.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Comment thread docling_jobkit/cli/local.py
Comment on lines 62 to 65
JobTaskTarget = Annotated[
ZipTarget | LocalPathTarget | S3Target | GoogleDriveTarget,
ZipTarget | LocalPathTarget | S3Target | GoogleDriveTarget | PresignedUrlTarget,
Field(discriminator="kind"),
]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no need to add PresignedUrlTarget to the JobTaskTarget and then later catch it in validation with if isinstance(config.target, PresignedUrlTarget):

Why is this done?

Comment thread docling_jobkit/connectors/s3_presigned_target_processor.py
Comment thread docling_jobkit/convert/results.py Outdated
Comment on lines +550 to +554
with get_target_processor(
task.target,
task=task,
s3_presigned_config=s3_presigned_config,
) as target_processor:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the call site that needs the extra arguments for get_target_processor. It is very questionable why this is needed when sitting in a elif isinstance(task.target, PresignedUrlTarget): - constructing the target processor directly would be more straight forward. get_target_processor is not a real factory, rather a utility.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant