Skip to content

docs: add Presidio component docs pages#11165

Merged
sjrl merged 7 commits intodeepset-ai:mainfrom
SyedShahmeerAli12:docs/presidio-components
Apr 24, 2026
Merged

docs: add Presidio component docs pages#11165
sjrl merged 7 commits intodeepset-ai:mainfrom
SyedShahmeerAli12:docs/presidio-components

Conversation

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor

@SyedShahmeerAli12 SyedShahmeerAli12 commented Apr 21, 2026

Related Issues

Proposed Changes

  • Adds separate docs pages per component:
    • preprocessors/presidiodocumentcleaner.mdx — PresidioDocumentCleaner
    • preprocessors/presidiotextcleaner.mdx — PresidioTextCleaner
    • extractors/presidioentityextractor.mdx — PresidioEntityExtractor (import path: haystack_integrations.components.extractors.presidio)
  • Adds PresidioDocumentCleaner and PresidioTextCleaner rows to preprocessors.mdx
  • Adds PresidioEntityExtractor row to extractors.mdx
  • Updates sidebars.js and version-2.28-sidebars.json
  • Mirrors all changes in versioned_docs/version-2.28

Checklist

Adds documentation for PresidioDocumentCleaner, PresidioTextCleaner,
and PresidioEntityExtractor under the Preprocessors section.

Related: deepset-ai/haystack-core-integrations#3063
@SyedShahmeerAli12 SyedShahmeerAli12 requested a review from a team as a code owner April 21, 2026 14:53
@SyedShahmeerAli12 SyedShahmeerAli12 requested review from bogdankostic and removed request for a team April 21, 2026 14:53
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 21, 2026

@SyedShahmeerAli12 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
haystack-docs Ready Ready Preview, Comment Apr 24, 2026 6:54am

Request Review

@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 22, 2026

@bogdankostic I can take this one since I reviewed the integration PR

@sjrl sjrl requested review from sjrl and removed request for bogdankostic April 22, 2026 06:59
@sjrl sjrl self-assigned this Apr 22, 2026
Comment thread docs-website/docs/pipeline-components/preprocessors/presidio.mdx Outdated
Comment thread docs-website/docs/pipeline-components/preprocessors.mdx Outdated
Comment thread docs-website/docs/pipeline-components/preprocessors/presidio.mdx Outdated
Copy link
Copy Markdown
Contributor Author

@SyedShahmeerAli12 SyedShahmeerAli12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sjrl — all three points are addressed in the latest commit:

  • Split the single combined presidio.mdx into three separate per-component files (presidiodocumentcleaner.mdx, presidiotextcleaner.mdx, presidioentityextractor.mdx)
  • Moved PresidioEntityExtractor to extractors/ and updated the import path to haystack_integrations.components.extractors.presidio
  • Removed PresidioEntityExtractor from preprocessors.mdx and added it to extractors.mdx

Both the current docs and versioned docs (version-2.28) are updated. Ready for re-review!

SyedShahmeerAli12

This comment was marked as resolved.

@SyedShahmeerAli12 SyedShahmeerAli12 changed the title docs: add Presidio preprocessors docs page docs: add Presidio component docs pages Apr 22, 2026
Comment thread docs-website/docs/pipeline-components/preprocessors.mdx
Comment thread docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx Outdated
Comment thread docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx Outdated
Comment thread docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx Outdated
Comment thread docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx Outdated
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 22, 2026

Please also update the other docs pages for the two cleaner components with the same feedback I provided for the entity extractor

Copy link
Copy Markdown
Contributor Author

@SyedShahmeerAli12 SyedShahmeerAli12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjrl all comments from the latest review are addressed in the latest commit:

  • Added ## Overview section explaining what Presidio is, what the extractor does (non-destructive stores PII as metadata rather than modifying text), and when you'd want to use it
  • Updated the spaCy comment to clarify it's for English and that other languages need a different model
  • Moved ## Configuration before ## Usage
  • Added Microsoft supported entities link to the entities row in the config table and removed the standalone sentence at the bottom
  • Added intro sentence under ## Usage
  • Moved the python config code block into a ### Using Custom Parameters subsection under Usage

Both current and versioned (version-2.28) docs are updated. Ready for re-review!

Comment thread docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx Outdated
… Installation heading

Per sjrl review: removes the separate ## Installation section from all three
Presidio component pages and moves the pip install + spaCy download block into
the Usage section, right after the intro sentence. Also removes the "Unlike the
cleaner components" phrasing from PresidioEntityExtractor's Overview since it's
not clear in context on a standalone page.

Applied to both current docs and versioned docs (version-2.28).
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

SyedShahmeerAli12 commented Apr 23, 2026

@sjrl addressed both comments from the latest review:

  • Removed "Unlike the cleaner components," from the PresidioEntityExtractor Overview.
    It now reads: "The extractor does not modify the document text..." without the unclear reference.

  • Removed the standalone ## Installation section from all three component pages
    (PresidioEntityExtractor, PresidioDocumentCleaner, PresidioTextCleaner) and moved the install + spaCy download block into the ## Usage section, right after the intro sentence.

Both current docs and versioned docs (version-2.28) have been updated.

})
```

## Configuration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow the same structure as the entity extractor page and put this configuration section right after the overview section

)
```

See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove this line and also put the link the configuration table. Make sure to do this for the text cleaner as well

})
```

## Configuration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here put this right after the overview section

)
```

See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here make sure to remove this line and add the link into the configuration table

…pported entities link, add Using Custom Parameters subsection

Per sjrl review: adds ## Overview section to PresidioDocumentCleaner and
PresidioTextCleaner pages explaining what Presidio is and when to use the
component. Moves ## Configuration to right after Overview (before Usage),
adds supported entities link into the entities config table row (removing
standalone sentence at bottom), and moves the custom parameters code block
into a ### Using Custom Parameters subsection under Usage.

Applied to both current docs and versioned docs (version-2.28).
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

SyedShahmeerAli12 commented Apr 23, 2026

@sjrl addressed all 8 comments added Overview, moved Configuration before Usage, added supported entities link in the config table, and moved custom params into Using Custom Parameters for both cleaner pages.

Copy link
Copy Markdown
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@sjrl sjrl merged commit 08a029a into deepset-ai:main Apr 24, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants