Skip to content

Commit cefa3e2

Browse files
Seth-Petersdavidsbatista
authored andcommitted
fix: LLMMetadataExtractor bug in handling Document objects with no content
* test(extractors): Add unit test for LLMMetadataExtractor with no content Adds a new unit test `test_run_with_document_content_none` to `TestLLMMetadataExtractor`. This test verifies that `LLMMetadataExtractor` correctly handles documents where `document.content` is None or an empty string. It ensures that: - Such documents are added to the `failed_documents` list. - The correct error message ("Document has no content, skipping LLM call.") is present in their metadata. - No actual LLM call is attempted for these documents. This test provides coverage for the fix that prevents an AttributeError when processing documents with no content. * chore: update comment to reflect new behavior in _run_on_thread method * docs: Add release note for LLMMetadataExtractor no content fix * Update releasenotes/notes/fix-llm-metadata-extractor-no-content-910067ea72094f18.yaml * Update fix-llm-metadata-extractor-no-content-910067ea72094f18.yaml --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>
1 parent 9982c0e commit cefa3e2

3 files changed

Lines changed: 44 additions & 2 deletions

File tree

haystack/components/extractors/llm_metadata_extractor.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -256,9 +256,9 @@ def _prepare_prompts(
256256
return all_prompts
257257

258258
def _run_on_thread(self, prompt: Optional[ChatMessage]) -> Dict[str, Any]:
259-
# If prompt is None, return an empty dictionary
259+
# If prompt is None, return an error dictionary
260260
if prompt is None:
261-
return {"replies": ["{}"]}
261+
return {"error": "Document has no content, skipping LLM call."}
262262

263263
try:
264264
result = self._chat_generator.run(messages=[prompt])
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
fixes:
3+
- |
4+
Fixed a bug in the `LLMMetadataExtractor` that occurred when
5+
processing `Document` objects with `None` or empty string content. The
6+
component now gracefully handles these cases by marking such documents as
7+
failed and providing an appropriate error message in their metadata, without
8+
attempting an LLM call.

test/components/extractors/test_llm_metadata_extractor.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,40 @@ def test_run_no_documents(self, monkeypatch):
219219
assert result["documents"] == []
220220
assert result["failed_documents"] == []
221221

222+
def test_run_with_document_content_none(self, monkeypatch):
223+
monkeypatch.setenv("OPENAI_API_KEY", "test-api-key")
224+
# Mock the chat generator to prevent actual LLM calls
225+
mock_chat_generator = Mock(spec=OpenAIChatGenerator)
226+
227+
extractor = LLMMetadataExtractor(
228+
prompt="prompt {{document.content}}", chat_generator=mock_chat_generator, expected_keys=["some_key"]
229+
)
230+
231+
# Document with None content
232+
doc_with_none_content = Document(content=None)
233+
# also test with empty string content
234+
doc_with_empty_content = Document(content="")
235+
docs = [doc_with_none_content, doc_with_empty_content]
236+
237+
result = extractor.run(documents=docs)
238+
239+
# Assert that the documents are in failed_documents
240+
assert len(result["documents"]) == 0
241+
assert len(result["failed_documents"]) == 2
242+
243+
failed_doc_none = result["failed_documents"][0]
244+
assert failed_doc_none.id == doc_with_none_content.id
245+
assert "metadata_extraction_error" in failed_doc_none.meta
246+
assert failed_doc_none.meta["metadata_extraction_error"] == "Document has no content, skipping LLM call."
247+
248+
failed_doc_empty = result["failed_documents"][1]
249+
assert failed_doc_empty.id == doc_with_empty_content.id
250+
assert "metadata_extraction_error" in failed_doc_empty.meta
251+
assert failed_doc_empty.meta["metadata_extraction_error"] == "Document has no content, skipping LLM call."
252+
253+
# Ensure no attempt was made to call the LLM
254+
mock_chat_generator.run.assert_not_called()
255+
222256
@pytest.mark.integration
223257
@pytest.mark.skipif(
224258
not os.environ.get("OPENAI_API_KEY", None),

0 commit comments

Comments
 (0)