Skip to content

Passing custom metadata per document with docling #3101

@julian-risch

Description

@julian-risch

Transferred from docling-project/docling-haystack#8 by @r-gg

Issue Description

When converting multiple documents, I want to pass several metadata fields which are different for each document. This functionality is available for multiple default haystack converters (e.g. for MarkdownToDocument). Just like in the default haystack converters, one should either be able to pass:

  1. a single dictionary whose fields will be added to the metadata of all chunks or
  2. a list of dictionaries having the same length as the list of passed documents (mapping fields of each dictionary to the metadata fields of the chunks of the respective document).

This is however not present in the current implementation. Workaround where the metadata would be set after conversion (with export type DOC_CHUNKS) is not possible for the following reason: When working with multiple documents (i.e. len(paths)>1) it is difficult to track which chunks belong to which document. Some documents can have the same filename and binary_hash, so for chunks belonging to these documents it is impossible to differentiate to which original document the chunk belongs.

Possible Solution

Add the optional meta parameter to the components DoclingConverter.run() method and expand the existing meta dictionaries (returned by the _meta_extractor) with the dictionary/dictionaries passed in the new meta parameter.

Comment by @lambda-science

Without the possiblity to pass metadata, the converter is not really useful to be honest.
Also to make it more in line with other haystack converter the main argument shouldn't be call paths but sources like:

sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,

Made a fork that can solve your problem docling-project/docling-haystack#9

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions