Skip to content

fix(classifiers): guard against content=None in DocumentLanguageClassifier (fixes #11418)#11425

Closed
devteamaegis wants to merge 2 commits into
deepset-ai:mainfrom
devteamaegis:fix/document-language-classifier-none-content
Closed

fix(classifiers): guard against content=None in DocumentLanguageClassifier (fixes #11418)#11425
devteamaegis wants to merge 2 commits into
deepset-ai:mainfrom
devteamaegis:fix/document-language-classifier-none-content

Conversation

@devteamaegis
Copy link
Copy Markdown

Summary

Fixes #11418DocumentLanguageClassifier crashes with TypeError when a Document has content=None.

Before:

def _detect_language(self, document: Document) -> str | None:
    language = None
    try:
        language = langdetect.detect(document.content)  # TypeError if content is None
    except langdetect.LangDetectException:
        ...

langdetect.detect(None) raises TypeError, which is not caught by the LangDetectException handler. The exception propagates to the caller, crashing the pipeline. This affects any blob-only Document (e.g. images, PDFs loaded without text extraction) since Document.content is explicitly allowed to be None.

After: an explicit None guard is added before calling langdetect.detect(). Documents with content=None log a warning and return None (which causes run() to route them to "unmatched"), consistent with existing behaviour for text that langdetect fails to detect.

Changes

haystack/components/classifiers/document_language_classifier.py

  • Added if document.content is None: guard at the top of _detect_language
  • Logs a warning including the document ID (same pattern as the LangDetectException branch)
  • Returns None so the caller routes the document to "unmatched"

test/components/classifiers/test_document_language_classifier.py
Three new tests:

Test Assertion
test_content_none_does_not_raise run([Document(content=None)]) must not raise; document gets language="unmatched"
test_content_none_emits_warning A warning containing the document ID is logged
test_mixed_none_and_text_content Batch with a None-content doc and a text doc both classified correctly

Test plan

  • All 10 tests in test_document_language_classifier.py pass: uv run --with langdetect --with pytest python -m pytest test/components/classifiers/test_document_language_classifier.py -v10 passed

@devteamaegis devteamaegis requested a review from a team as a code owner May 28, 2026 08:19
@devteamaegis devteamaegis requested review from sjrl and removed request for a team May 28, 2026 08:19
@vercel
Copy link
Copy Markdown

vercel Bot commented May 28, 2026

@devteamaegis is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels May 28, 2026
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented May 28, 2026

Closing as duplicate of #11419

@sjrl sjrl closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: DocumentLanguageClassifier crashes with TypeError when Document has content=None

3 participants