Skip to content

Commit 4dc0c7d

Browse files
authored
chore: deprecate Spacy NamedEntityExtractor and add docs (#11613)
1 parent ba27d46 commit 4dc0c7d

10 files changed

Lines changed: 264 additions & 35 deletions

File tree

docs-website/docs/pipeline-components/extractors.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,5 @@ slug: "/extractors"
1313
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. |
1414
| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. |
1515
| [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. |
16+
| [SpacyNamedEntityExtractor](extractors/spacynamedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. Uses a spaCy model. |
17+
| [TransformersNamedEntityExtractor](extractors/transformersnamedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. Uses a Hugging Face model. |

docs-website/docs/pipeline-components/extractors/namedentityextractor.mdx

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,10 @@ This component extracts predefined entities out of a piece of text and writes th
1111

1212
:::warning[Deprecated]
1313

14-
`NamedEntityExtractor` is deprecated and will be removed in Haystack 3.0. It has moved to the `transformers-haystack` package and was renamed to `TransformersNamedEntityExtractor`. See [TransformersNamedEntityExtractor](transformersnamedentityextractor.mdx) for the updated documentation.
14+
`NamedEntityExtractor` is deprecated and will be removed in Haystack 3.0. It has moved to dedicated Core Integrations packages depending on the backend:
15+
16+
- Hugging Face backend: `transformers-haystack` package, renamed to `TransformersNamedEntityExtractor`. See [TransformersNamedEntityExtractor](transformersnamedentityextractor.mdx) for the updated documentation.
17+
- spaCy backend: `spacy-haystack` package, renamed to `SpacyNamedEntityExtractor`. See [SpacyNamedEntityExtractor](spacynamedentityextractor.mdx) for the updated documentation.
1518

1619
:::
1720

@@ -65,16 +68,16 @@ documents = [
6568
Document(content="New York State is home to the Empire State Building."),
6669
]
6770

68-
extractor.run(documents)
69-
print(documents)
71+
result = extractor.run(documents)
72+
print(result["documents"])
7073
```
7174

7275
Here is the example result:
7376

7477
```python
75-
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=11, end=16, score=0.99641764), NamedEntityAnnotation(entity='LOC', start=31, end=39, score=0.996198), NamedEntityAnnotation(entity='LOC', start=41, end=51, score=0.9990196)]}),
76-
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=4, end=10, score=0.99054915)]}),
77-
Document(id=44948ea0eec018b33aceaaedde4616eb9e93ce075e0090ec1613fc145f84b4a9, content: 'New York State is home to the Empire State Building.', meta: {'named_entities': [NamedEntityAnnotation(entity='LOC', start=0, end=14, score=0.9989541), NamedEntityAnnotation(entity='LOC', start=30, end=51, score=0.95746297)]})]
78+
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=11, end=16, score=np.float32(0.99641764)), NamedEntityAnnotation(entity='LOC', start=31, end=39, score=np.float32(0.996198)), NamedEntityAnnotation(entity='LOC', start=41, end=51, score=np.float32(0.9990196))]}),
79+
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=4, end=10, score=np.float32(0.99054915))]}),
80+
Document(id=44948ea0eec018b33aceaaedde4616eb9e93ce075e0090ec1613fc145f84b4a9, content: 'New York State is home to the Empire State Building.', meta: {'named_entities': [NamedEntityAnnotation(entity='LOC', start=0, end=14, score=np.float32(0.9989541)), NamedEntityAnnotation(entity='LOC', start=30, end=51, score=np.float32(0.9574631))]})]
7881
```
7982

8083
### Get stored annotations
@@ -93,9 +96,11 @@ documents = [
9396
Document(content="New York State is home to the Empire State Building."),
9497
]
9598

96-
extractor.run(documents)
99+
result = extractor.run(documents)
97100

98-
annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in documents]
101+
annotations = [
102+
NamedEntityExtractor.get_stored_annotations(doc) for doc in result["documents"]
103+
]
99104
print(annotations)
100105

101106
# If a Document doesn't contain any annotations, this returns None.
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
title: "SpacyNamedEntityExtractor"
3+
id: spacynamedentityextractor
4+
slug: "/spacynamedentityextractor"
5+
description: "This component extracts predefined entities out of a piece of text and writes them into documents’ meta field."
6+
---
7+
8+
# SpacyNamedEntityExtractor
9+
10+
This component extracts predefined entities out of a piece of text and writes them into documents’ meta field.
11+
12+
<div className="key-value-table">
13+
14+
| | |
15+
| --- | --- |
16+
| **Most common position in a pipeline** | After the [PreProcessor](../preprocessors.mdx) in an indexing pipeline or after a [Retriever](../retrievers.mdx) in a query pipeline |
17+
| **Mandatory init variables** | `model`: Name or path of the spaCy model to use |
18+
| **Mandatory run variables** | `documents`: A list of documents |
19+
| **Output variables** | `documents`: A list of documents |
20+
| **API reference** | [Spacy](/reference/integrations-spacy) |
21+
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/spacy |
22+
| **Package name** | `spacy-haystack` |
23+
24+
</div>
25+
26+
## Overview
27+
28+
`SpacyNamedEntityExtractor` looks for entities, which are spans in the text. The extractor automatically recognizes and groups them depending on their class, such as people's names, organizations, locations, and other types. The exact classes are determined by the model that you initialize the component with.
29+
30+
`SpacyNamedEntityExtractor` takes a list of documents as input and returns a list of the same documents with their `meta` data enriched with `NamedEntityAnnotations`. A `NamedEntityAnnotation` consists of the type of the entity and the start and end of the span, for example: `NamedEntityAnnotation(entity='PERSON', start=11, end=16, score=None)`.
31+
32+
When the `SpacyNamedEntityExtractor` is initialized, you need to set a `model`. Optionally, you can set `pipeline_kwargs`, which are then passed on to the spaCy pipeline. You can additionally set the `device` that is used to run the component.
33+
34+
## Usage
35+
36+
Install the `spacy-haystack` package to use the `SpacyNamedEntityExtractor`:
37+
38+
```shell
39+
pip install spacy-haystack
40+
```
41+
42+
The component works with any [spaCy model](https://spacy.io/models) that contains an NER component.
43+
44+
`SpacyNamedEntityExtractor` accepts a list of `Documents` as its input. The extractor annotates the raw text in the documents and stores the annotations in the document's `meta` dictionary under the `named_entities` key.
45+
46+
```python
47+
from haystack.dataclasses import Document
48+
from haystack_integrations.components.extractors.spacy import (
49+
SpacyNamedEntityExtractor,
50+
)
51+
52+
extractor = SpacyNamedEntityExtractor(model="en_core_web_sm")
53+
54+
documents = [
55+
Document(content="My name is Clara and I live in Berkeley, California."),
56+
Document(content="I'm Merlin, the happy pig!"),
57+
Document(content="New York State is home to the Empire State Building."),
58+
]
59+
60+
result = extractor.run(documents)
61+
print(result["documents"])
62+
```
63+
64+
Here is the example result:
65+
66+
```python
67+
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PERSON', start=11, end=16, score=None), NamedEntityAnnotation(entity='GPE', start=31, end=39, score=None), NamedEntityAnnotation(entity='GPE', start=41, end=51, score=None)]}),
68+
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PERSON', start=4, end=10, score=None)]}),
69+
Document(id=44948ea0eec018b33aceaaedde4616eb9e93ce075e0090ec1613fc145f84b4a9, content: 'New York State is home to the Empire State Building.', meta: {'named_entities': [NamedEntityAnnotation(entity='GPE', start=0, end=14, score=None), NamedEntityAnnotation(entity='ORG', start=26, end=51, score=None)]})]
70+
```
71+
72+
### Get stored annotations
73+
74+
This component includes the `get_stored_annotations` helper class method that allows you to retrieve the annotations stored in a `Document` transparently:
75+
76+
```python
77+
from haystack.dataclasses import Document
78+
from haystack_integrations.components.extractors.spacy import (
79+
SpacyNamedEntityExtractor,
80+
)
81+
82+
extractor = SpacyNamedEntityExtractor(model="en_core_web_sm")
83+
84+
documents = [
85+
Document(content="My name is Clara and I live in Berkeley, California."),
86+
Document(content="I'm Merlin, the happy pig!"),
87+
Document(content="New York State is home to the Empire State Building."),
88+
]
89+
90+
result = extractor.run(documents)
91+
92+
annotations = [
93+
SpacyNamedEntityExtractor.get_stored_annotations(doc) for doc in result["documents"]
94+
]
95+
print(annotations)
96+
97+
# If a Document doesn't contain any annotations, this returns None.
98+
new_doc = Document(content="In one of many possible worlds...")
99+
assert SpacyNamedEntityExtractor.get_stored_annotations(new_doc) is None
100+
```

docs-website/docs/pipeline-components/extractors/transformersnamedentityextractor.mdx

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -59,16 +59,16 @@ documents = [
5959
Document(content="New York State is home to the Empire State Building."),
6060
]
6161

62-
extractor.run(documents)
63-
print(documents)
62+
result = extractor.run(documents)
63+
print(result["documents"])
6464
```
6565

6666
Here is the example result:
6767

6868
```python
69-
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=11, end=16, score=0.99641764), NamedEntityAnnotation(entity='LOC', start=31, end=39, score=0.996198), NamedEntityAnnotation(entity='LOC', start=41, end=51, score=0.9990196)]}),
70-
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=4, end=10, score=0.99054915)]}),
71-
Document(id=44948ea0eec018b33aceaaedde4616eb9e93ce075e0090ec1613fc145f84b4a9, content: 'New York State is home to the Empire State Building.', meta: {'named_entities': [NamedEntityAnnotation(entity='LOC', start=0, end=14, score=0.9989541), NamedEntityAnnotation(entity='LOC', start=30, end=51, score=0.95746297)]})]
69+
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=11, end=16, score=np.float32(0.99641764)), NamedEntityAnnotation(entity='LOC', start=31, end=39, score=np.float32(0.996198)), NamedEntityAnnotation(entity='LOC', start=41, end=51, score=np.float32(0.9990196))]}),
70+
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=4, end=10, score=np.float32(0.99054915))]}),
71+
Document(id=44948ea0eec018b33aceaaedde4616eb9e93ce075e0090ec1613fc145f84b4a9, content: 'New York State is home to the Empire State Building.', meta: {'named_entities': [NamedEntityAnnotation(entity='LOC', start=0, end=14, score=np.float32(0.9989541)), NamedEntityAnnotation(entity='LOC', start=30, end=51, score=np.float32(0.9574631))]})]
7272
```
7373

7474
### Get stored annotations
@@ -89,10 +89,11 @@ documents = [
8989
Document(content="New York State is home to the Empire State Building."),
9090
]
9191

92-
extractor.run(documents)
92+
result = extractor.run(documents)
9393

9494
annotations = [
95-
TransformersNamedEntityExtractor.get_stored_annotations(doc) for doc in documents
95+
TransformersNamedEntityExtractor.get_stored_annotations(doc)
96+
for doc in result["documents"]
9697
]
9798
print(annotations)
9899

docs-website/versioned_docs/version-2.30/pipeline-components/extractors.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,5 @@ slug: "/extractors"
1313
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. |
1414
| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. |
1515
| [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. |
16+
| [SpacyNamedEntityExtractor](extractors/spacynamedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. Uses a spaCy model. |
17+
| [TransformersNamedEntityExtractor](extractors/transformersnamedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. Uses a Hugging Face model. |

docs-website/versioned_docs/version-2.30/pipeline-components/extractors/namedentityextractor.mdx

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,10 @@ This component extracts predefined entities out of a piece of text and writes th
1111

1212
:::warning[Deprecated]
1313

14-
`NamedEntityExtractor` is deprecated and will be removed in Haystack 3.0. It has moved to the `transformers-haystack` package and was renamed to `TransformersNamedEntityExtractor`. See [TransformersNamedEntityExtractor](transformersnamedentityextractor.mdx) for the updated documentation.
14+
`NamedEntityExtractor` is deprecated and will be removed in Haystack 3.0. It has moved to dedicated Core Integrations packages depending on the backend:
15+
16+
- Hugging Face backend: `transformers-haystack` package, renamed to `TransformersNamedEntityExtractor`. See [TransformersNamedEntityExtractor](transformersnamedentityextractor.mdx) for the updated documentation.
17+
- spaCy backend: `spacy-haystack` package, renamed to `SpacyNamedEntityExtractor`. See [SpacyNamedEntityExtractor](spacynamedentityextractor.mdx) for the updated documentation.
1518

1619
:::
1720

@@ -65,16 +68,16 @@ documents = [
6568
Document(content="New York State is home to the Empire State Building."),
6669
]
6770

68-
extractor.run(documents)
69-
print(documents)
71+
result = extractor.run(documents)
72+
print(result["documents"])
7073
```
7174

7275
Here is the example result:
7376

7477
```python
75-
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=11, end=16, score=0.99641764), NamedEntityAnnotation(entity='LOC', start=31, end=39, score=0.996198), NamedEntityAnnotation(entity='LOC', start=41, end=51, score=0.9990196)]}),
76-
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=4, end=10, score=0.99054915)]}),
77-
Document(id=44948ea0eec018b33aceaaedde4616eb9e93ce075e0090ec1613fc145f84b4a9, content: 'New York State is home to the Empire State Building.', meta: {'named_entities': [NamedEntityAnnotation(entity='LOC', start=0, end=14, score=0.9989541), NamedEntityAnnotation(entity='LOC', start=30, end=51, score=0.95746297)]})]
78+
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=11, end=16, score=np.float32(0.99641764)), NamedEntityAnnotation(entity='LOC', start=31, end=39, score=np.float32(0.996198)), NamedEntityAnnotation(entity='LOC', start=41, end=51, score=np.float32(0.9990196))]}),
79+
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PER', start=4, end=10, score=np.float32(0.99054915))]}),
80+
Document(id=44948ea0eec018b33aceaaedde4616eb9e93ce075e0090ec1613fc145f84b4a9, content: 'New York State is home to the Empire State Building.', meta: {'named_entities': [NamedEntityAnnotation(entity='LOC', start=0, end=14, score=np.float32(0.9989541)), NamedEntityAnnotation(entity='LOC', start=30, end=51, score=np.float32(0.9574631))]})]
7881
```
7982

8083
### Get stored annotations
@@ -93,9 +96,11 @@ documents = [
9396
Document(content="New York State is home to the Empire State Building."),
9497
]
9598

96-
extractor.run(documents)
99+
result = extractor.run(documents)
97100

98-
annotations = [NamedEntityExtractor.get_stored_annotations(doc) for doc in documents]
101+
annotations = [
102+
NamedEntityExtractor.get_stored_annotations(doc) for doc in result["documents"]
103+
]
99104
print(annotations)
100105

101106
# If a Document doesn't contain any annotations, this returns None.

0 commit comments

Comments
 (0)