feat: Add presigned artifact targets and internal result model cleanup#155
feat: Add presigned artifact targets and internal result model cleanup#155cau-git wants to merge 5 commits into
Conversation
- support in jobkit with server-managed S3 uploads and presigned artifact results - wire target configuration through local, RQ, and Ray orchestrators and reject in legacy CLI paths - switch conversion and chunking flows to build internally and map to shared wire models - unify export materialization across zip/put and presigned flows and add focused regression tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
|
✅ DCO Check Passed Thanks @cau-git, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
bae9255 to
4eaf1ac
Compare
| from docling.datamodel.service.targets import PresignedUrlTarget | ||
|
|
||
|
|
||
| def ensure_legacy_target_supported(target: object) -> None: |
There was a problem hiding this comment.
could be handled simpler than requiring this check.
| config = JobConfig(**raw_data) | ||
| except FileNotFoundError: | ||
| typer.echo(f"❌ File not found: {config_file}") | ||
| raise typer.Exit(1) |
There was a problem hiding this comment.
why was typer.Exit(1) added, isnt' that out of scope?
| url_expiration: int = Field(default=3600, ge=60, le=604800) | ||
|
|
||
|
|
||
| class TargetConfig(BaseModel): |
There was a problem hiding this comment.
Looks like a stub class, why needed?
| date_partition_format: str = "%Y%m%d" | ||
| include_tenant_in_path: bool = True | ||
| include_task_id_in_path: bool = True | ||
| attach_metadata: bool = True | ||
| metadata_fields: list[str] = ["tenant_id", "user_id", "project_id"] |
There was a problem hiding this comment.
Looks like overengineered configurability.
| ArtifactType = Literal["json", "html", "markdown", "text", "doctags", "resource_bundle"] | ||
|
|
||
|
|
||
| class S3PresignedTargetProcessor(BaseTargetProcessor): |
There was a problem hiding this comment.
Why do we need an entire implementation for this S3PresignedTargetProcessor repeating all logic. There is already an S3TargetProcessor which one could factor things from to share.
| if isinstance(target, PresignedUrlTarget): | ||
| if target_config is None or target_config.s3_presigned is None: | ||
| raise ValueError( | ||
| "PresignedUrlTarget requires TargetConfig.s3_presigned in orchestrator config" | ||
| ) | ||
| if task is None: | ||
| raise ValueError("PresignedUrlTarget requires the current task context") | ||
| return S3PresignedTargetProcessor(target_config.s3_presigned, task) |
There was a problem hiding this comment.
Explain why this logic is needed and what the extra arguments do.
| status=exportable_document.status, | ||
| timings=exportable_document.timings, | ||
| errors=errors, | ||
| doc_result = _to_export_result( |
There was a problem hiding this comment.
_to_export_result was removed as there should be no need after the data models are symmetric now.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
| JobTaskTarget = Annotated[ | ||
| ZipTarget | LocalPathTarget | S3Target | GoogleDriveTarget, | ||
| ZipTarget | LocalPathTarget | S3Target | GoogleDriveTarget | PresignedUrlTarget, | ||
| Field(discriminator="kind"), | ||
| ] |
There was a problem hiding this comment.
There was no need to add PresignedUrlTarget to the JobTaskTarget and then later catch it in validation with if isinstance(config.target, PresignedUrlTarget):
Why is this done?
| with get_target_processor( | ||
| task.target, | ||
| task=task, | ||
| s3_presigned_config=s3_presigned_config, | ||
| ) as target_processor: |
There was a problem hiding this comment.
This is the call site that needs the extra arguments for get_target_processor. It is very questionable why this is needed when sitting in a elif isinstance(task.target, PresignedUrlTarget): - constructing the target processor directly would be more straight forward. get_target_processor is not a real factory, rather a utility.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This change prepares
docling-jobkitfor server-managed artifact delivery while cleaning up the internal result model boundary. The main goal is to let service/orchestrated execution paths write converted outputs to managed storage and return artifact references without exposing storage credentials, while keeping existing response shapes stable for current consumers.PresignedUrlTargetindocling-jobkit, including server-side target configuration, an S3-backed presigned artifact processor, and orchestrator wiring for local, RQ, and Ray execution paths.PresignedArtifactResultwith typed artifact metadata, presigned URLs, and server-controlled expiration.docling, includingPresignedUrlTarget,ArtifactRef,DocumentArtifactItem,DocumentResultItem, andPresignedArtifactResult, so jobkit stays aligned with the pinned upstream service models.DocumentResultIteminternally and mapping to legacy wire models via shared helpers, instead of constructingExportResultdirectly in production conversion and chunking code.presigned_urltargets in the legacy local and multiprocess CLI paths, since those paths do not have the task and server target configuration required to generate managed artifact URLs.see also: docling-project/docling#3486