fixed bug where MarkdownHeaderSplitter's split result missed the first direct parent's header in the metadata and added lark to pyproject.toml#11042
Conversation
…t direct parent's header in the metadata and added lark to pyproject.toml
|
@MechaCritter is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml
Outdated
Show resolved
Hide resolved
releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml
Outdated
Show resolved
Hide resolved
|
@MechaCritter thanks for opening a PR! I see you are missing tests for the change. Could you add a test based on your example showing how your changes fix the issue. |
…eaders_for_first_child" to prove my concept.
…der_splitter' into bugfix/markdown_header_splitter
I added the corresponding test. Here are the corresponding log files as proof that my fix works: Success run on bugfix branch: Failed run on current main branch: I ran all other tests again for the markdown_header_splitter (test_markdown_header_splitter.py). Everything passed, so no break changes were introduced. |
releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml
Outdated
Show resolved
Hide resolved
releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml
Outdated
Show resolved
Hide resolved
releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml
Outdated
Show resolved
Hide resolved
…derSplitter-b5db96e19011b6b9.yaml written description in past tense Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
…derSplitter-b5db96e19011b6b9.yaml written description in past tense Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
…derSplitter-b5db96e19011b6b9.yaml written description in past tense Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml
Outdated
Show resolved
Hide resolved
|
@MechaCritter a few more minor comments but otherwise looks good! The logic is much easier to understand the previous as well! |
…derSplitter-b5db96e19011b6b9.yaml Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
removed the "secondary_split="word"" argument as error happens regardless of if secondary split is present Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
broke down text in unit test for readability Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
added sanity checking (reconstructed text is equal original text) Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
|
I'm actually a bit confused: the CI pipeline failed a mypy test, which has nothing to do with my changes. @sjrl could you help me out? |
|
Hey @MechaCritter yeah it was related to a new release of sentence transformers. Fix incoming here #11066 once merged we can update your branch |
Coverage Report for CI Build 24196334786Coverage decreased (-0.002%) to 92.841%Details
Uncovered ChangesNo uncovered changes found. Coverage Regressions6 previously-covered lines in 1 file lost coverage.
Coverage Stats
💛 - Coveralls |
Related Issues
s header in theparent_headers` key in the metadata of the chunk #11041Proposed Changes:
The error that occured in the issue #11041 happens when a parent header has its own content chunk before the first
child. In that path the code clears
active_parents, so the first child loses itsancestor while later siblings inherit it correctly.
The fix is to derive
parent_headersfrom the currentheader_stackinstead ofcarrying a separate mutable “active parents” list that gets out of sync after contentful
headers.
How did you test it?
Tested locally using:
Notes for the reviewer
You can copy the
haystack/components/preprocessors/markdown_header_splitter.pyfile and rerun the code snippet provided in issue #11041. The bug does not occur anymore afterward.Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.