Skip to content

Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names#3959

Merged
badGarnet merged 7 commits intoUnstructured-IO:mainfrom
srisudarsan:main
Mar 21, 2025
Merged

Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names#3959
badGarnet merged 7 commits intoUnstructured-IO:mainfrom
srisudarsan:main

Conversation

@srisudarsan
Copy link
Copy Markdown
Contributor

Instead of looking for presence of word/document.xml , ppt/presentation.xml and xl/workbook.xml to identify DOCX,PPTX and XLSX files, we look for prefix word/document*.xml, ppt/presentation*.xml and xl/workbook*.xml as certain files generated from office365 has files with different names.
Fixes #3937

@srisudarsan
Copy link
Copy Markdown
Contributor Author

Potentially looked at another fix, to read the contents of [Content_Types].xml in the ZIP and identify the content type present in it. But this would involve reading the XML and parsing it to identify the necessary info.

image

Let me know your thought on the same

@badGarnet badGarnet requested a review from scanny March 17, 2025 13:56
@badGarnet
Copy link
Copy Markdown
Collaborator

Hi @srisudarsan, thanks for the contribution. The fix looks reasonable. Could you please update the changelog and bump the version (convention is to bump from devn to dev(n+1)). Then I can let the ci run (ci would stop if detect changelog workflow fails).

Copy link
Copy Markdown
Contributor

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@badGarnet looks good to me :)

filenames = zip.namelist()

if "word/document.xml" in filenames:
if any(re.match(r"word/document.*\.xml$", filename) for filename in filenames):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the final word on is_docx will be a package part with the ContentType you highlighted in the .rels file, but I'm fine with this as a step toward that :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned too. Will raise another PR with changes incorporating the content type identified from [Content_Types].xml

@srisudarsan
Copy link
Copy Markdown
Contributor Author

Hi @srisudarsan, thanks for the contribution. The fix looks reasonable. Could you please update the changelog and bump the version (convention is to bump from devn to dev(n+1)). Then I can let the ci run (ci would stop if detect changelog workflow fails).

Done

@srisudarsan
Copy link
Copy Markdown
Contributor Author

@badGarnet , seems like the previous run failed, I have updated the CHANGELOG with the recent number, can you approve this flow now ?

@badGarnet
Copy link
Copy Markdown
Collaborator

@badGarnet , seems like the previous run failed, I have updated the CHANGELOG with the recent number, can you approve this flow now ?

seem you forgot to bump the version in __version__.py: https://github.com/Unstructured-IO/unstructured/actions/runs/13950355943/job/39072924215?pr=3959#step:5:1012

@srisudarsan
Copy link
Copy Markdown
Contributor Author

@badGarnet , seems like the previous run failed, I have updated the CHANGELOG with the recent number, can you approve this flow now ?

seem you forgot to bump the version in __version__.py: https://github.com/Unstructured-IO/unstructured/actions/runs/13950355943/job/39072924215?pr=3959#step:5:1012

Thanks for highlighting, wasn't aware, done now

@badGarnet
Copy link
Copy Markdown
Collaborator

@badGarnet , seems like the previous run failed, I have updated the CHANGELOG with the recent number, can you approve this flow now ?

seem you forgot to bump the version in __version__.py: https://github.com/Unstructured-IO/unstructured/actions/runs/13950355943/job/39072924215?pr=3959#step:5:1012

Thanks for highlighting, wasn't aware, done now

Linting is now complaining of missing empty line. I would suggest you install pre-commit tool in your env via pip install pre-commit && pre-commit install. The repo has a pre-commit configuration to automatically fix most of the linting issues on changed files and it is a real life saver. Once you've done the install just edit and save the version.py file by adding a new line at the end of the file and commit. The pre-commit tool will fix any remaining issues.

@srisudarsan
Copy link
Copy Markdown
Contributor Author

@badGarnet , seems like the previous run failed, I have updated the CHANGELOG with the recent number, can you approve this flow now ?

seem you forgot to bump the version in __version__.py: https://github.com/Unstructured-IO/unstructured/actions/runs/13950355943/job/39072924215?pr=3959#step:5:1012

Thanks for highlighting, wasn't aware, done now

Linting is now complaining of missing empty line. I would suggest you install pre-commit tool in your env via pip install pre-commit && pre-commit install. The repo has a pre-commit configuration to automatically fix most of the linting issues on changed files and it is a real life saver. Once you've done the install just edit and save the version.py file by adding a new line at the end of the file and commit. The pre-commit tool will fix any remaining issues.

I don't think I am getting it right. pre-commit runs and removes the line if added, can you please help making this change and getting this through ?
image

Comment thread unstructured/__version__.py Outdated
Comment thread unstructured/__version__.py Outdated
@badGarnet badGarnet added this pull request to the merge queue Mar 21, 2025
Merged via the queue into Unstructured-IO:main with commit 3497281 Mar 21, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/unable to process .docx file from office 365

3 participants