Skip to content

fix: handle Document with content=None in DocumentLanguageClassifier#11419

Merged
julian-risch merged 4 commits into
deepset-ai:mainfrom
rautaditya2606:fix/document-language-classifier-none-content
May 29, 2026
Merged

fix: handle Document with content=None in DocumentLanguageClassifier#11419
julian-risch merged 4 commits into
deepset-ai:mainfrom
rautaditya2606:fix/document-language-classifier-none-content

Conversation

@rautaditya2606
Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes

  • Guard Document entries with content=None in DocumentLanguageClassifier._detect_language() to log a warning and return unmatched instead of crashing.
  • Added regression test for the None content case.
  • Added release note.

How did you test it?

  • hatch run test:unit test/components/classifiers/test_document_language_classifier.py -k none_content

Notes for the reviewer

  • Keeps behavior consistent with existing unmatched handling when LangDetectException is raised.
  • Document.content is explicitly allowed to be None (blob-only documents), so this is a valid edge case the classifier should handle gracefully.

Checklist

Copilot AI review requested due to automatic review settings May 27, 2026 10:51
@rautaditya2606 rautaditya2606 requested a review from a team as a code owner May 27, 2026 10:51
@rautaditya2606 rautaditya2606 requested review from julian-risch and removed request for a team May 27, 2026 10:51
@vercel
Copy link
Copy Markdown

vercel Bot commented May 27, 2026

@rautaditya2606 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR prevents DocumentLanguageClassifier from crashing when a Document has content=None, treating it as “unmatched” and logging a warning.

Changes:

  • Add an early-return in language detection when Document.content is None (with a warning log).
  • Add a regression test to ensure None content is marked as unmatched and emits a warning.
  • Add release notes describing the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
test/components/classifiers/test_document_language_classifier.py Adds a regression test for content=None behavior.
haystack/components/classifiers/document_language_classifier.py Handles Document.content is None without crashing and logs a warning.
releasenotes/notes/document-language-classifier-none-content-a245d044b02a19b0.yaml Documents the fix in the release notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +114 to 121
if document.content is None:
logger.warning(
"Langdetect cannot detect the language of Document with id: {document_id}", document_id=document.id
)
return language
try:
language = langdetect.detect(document.content)
except langdetect.LangDetectException:
@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
haystack-docs Ignored Ignored Preview May 29, 2026 10:12am

Request Review

Copy link
Copy Markdown
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you for your contribution @rautaditya2606 !
I made two small edits. I thought about another edit to address that we are now emitting the same warning from two branches and we could reduce code duplication with something like the following. However, I decided against that change because it doesn't really reduce overall complexity of the code.

def _detect_language(self, document: Document) -> str | None:
    language = None
    if document.content is not None:
        try:
            language = langdetect.detect(document.content)
        except langdetect.LangDetectException:
            pass
    if language is None:
        logger.warning(
            "Langdetect cannot detect the language of Document with id: {document_id}",
            document_id=document.id,
        )
    return language

@julian-risch julian-risch enabled auto-merge (squash) May 29, 2026 10:12
@julian-risch julian-risch merged commit cdeec75 into deepset-ai:main May 29, 2026
23 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/components/classifiers
  document_language_classifier.py
Project Total  

This report was generated by python-coverage-comment-action

@rautaditya2606
Copy link
Copy Markdown
Contributor Author

Thanks for the review and the edits!

I also considered consolidating the warning logic similarly, but I agree that the current structure keeps the control flow a bit clearer without adding extra branching complexity. Appreciate the feedback and the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: DocumentLanguageClassifier crashes with TypeError when Document has content=None

3 participants