Skip to content

feat: include hyperlink addresses in DOCXToDocument output#9109

Merged
julian-risch merged 6 commits intomainfrom
docx-links
Mar 25, 2025
Merged

feat: include hyperlink addresses in DOCXToDocument output#9109
julian-risch merged 6 commits intomainfrom
docx-links

Conversation

@julian-risch
Copy link
Copy Markdown
Member

@julian-risch julian-risch commented Mar 25, 2025

Related Issues

Proposed Changes:

  • Add DOCXLinkFormat with the three options "markdown", "plain", and "none"
  • Add link_format init parameter to DOCXToDocument with none as the default, keeping behavior unchanged by default
  • Detect hyperlinks in text paragraphs and convert them in _extract_elements if link_format is not "none"
  • Add new tests

How did you test it?

  • newly added unit tests

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels Mar 25, 2025
@julian-risch julian-risch marked this pull request as ready for review March 25, 2025 11:17
@julian-risch julian-risch requested review from a team as code owners March 25, 2025 11:17
@julian-risch julian-risch requested review from anakin87 and dfokina and removed request for a team March 25, 2025 11:17
Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good.

I left some minor comments.
Feel free to address those of them which make sense.

Comment thread haystack/components/converters/docx.py
Comment thread haystack/components/converters/docx.py Outdated
@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Mar 25, 2025

Pull Request Test Coverage Report for Build 14060583426

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 42 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+1.0%) to 91.131%

Files with Coverage Reduction New Missed Lines %
components/converters/docx.py 1 99.43%
core/pipeline/pipeline.py 3 94.83%
core/pipeline/async_pipeline.py 38 62.42%
Totals Coverage Status
Change from base Build 14043167436: 1.0%
Covered Lines: 411
Relevant Lines: 451

💛 - Coveralls

@julian-risch julian-risch enabled auto-merge (squash) March 25, 2025 13:18
@julian-risch julian-risch merged commit e64db61 into main Mar 25, 2025
18 checks passed
@julian-risch julian-risch deleted the docx-links branch March 25, 2025 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DOCX Converter does not resolve link information

3 participants