fix: handle Document with content=None in DocumentLanguageClassifier#11419
Conversation
|
@rautaditya2606 is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR prevents DocumentLanguageClassifier from crashing when a Document has content=None, treating it as “unmatched” and logging a warning.
Changes:
- Add an early-return in language detection when
Document.contentisNone(with a warning log). - Add a regression test to ensure
Nonecontent is marked asunmatchedand emits a warning. - Add release notes describing the fix.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/components/classifiers/test_document_language_classifier.py | Adds a regression test for content=None behavior. |
| haystack/components/classifiers/document_language_classifier.py | Handles Document.content is None without crashing and logs a warning. |
| releasenotes/notes/document-language-classifier-none-content-a245d044b02a19b0.yaml | Documents the fix in the release notes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if document.content is None: | ||
| logger.warning( | ||
| "Langdetect cannot detect the language of Document with id: {document_id}", document_id=document.id | ||
| ) | ||
| return language | ||
| try: | ||
| language = langdetect.detect(document.content) | ||
| except langdetect.LangDetectException: |
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
julian-risch
left a comment
There was a problem hiding this comment.
Looks good to me! Thank you for your contribution @rautaditya2606 !
I made two small edits. I thought about another edit to address that we are now emitting the same warning from two branches and we could reduce code duplication with something like the following. However, I decided against that change because it doesn't really reduce overall complexity of the code.
def _detect_language(self, document: Document) -> str | None:
language = None
if document.content is not None:
try:
language = langdetect.detect(document.content)
except langdetect.LangDetectException:
pass
if language is None:
logger.warning(
"Langdetect cannot detect the language of Document with id: {document_id}",
document_id=document.id,
)
return language
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||
|
Thanks for the review and the edits! I also considered consolidating the warning logic similarly, but I agree that the current structure keeps the control flow a bit clearer without adding extra branching complexity. Appreciate the feedback and the merge. |
Related Issues
Proposed Changes
Documententries withcontent=NoneinDocumentLanguageClassifier._detect_language()to log a warning and returnunmatchedinstead of crashing.Nonecontent case.How did you test it?
hatch run test:unit test/components/classifiers/test_document_language_classifier.py -k none_contentNotes for the reviewer
unmatchedhandling whenLangDetectExceptionis raised.Document.contentis explicitly allowed to beNone(blob-only documents), so this is a valid edge case the classifier should handle gracefully.Checklist