Add FERPAMetadataRetriever component for regulated higher-education RAG#11079
Closed
ashutoshrana wants to merge 1 commit intodeepset-ai:mainfrom
Closed
Add FERPAMetadataRetriever component for regulated higher-education RAG#11079ashutoshrana wants to merge 1 commit intodeepset-ai:mainfrom
ashutoshrana wants to merge 1 commit intodeepset-ai:mainfrom
Conversation
Adds a new retriever component that enforces FERPA (Family Educational Rights and Privacy Act) identity boundaries at the document store query layer, before any ranking or scoring occurs. ## What this adds `haystack/components/retrievers/ferpa_metadata_retriever.py` - `FERPAMetadataRetriever` — wraps any `DocumentStore` and builds a compound AND filter from `student_id` + `institution_id` metadata fields - Identity values can be set at init time (session-scoped) or at run time (per-query override), with run-time values taking precedence - Configurable metadata field names via `student_id_field` and `institution_id_field` for deployments using non-standard field naming - `run()` and `run_async()` both implemented - Full `to_dict()` / `from_dict()` serialization - Works in Haystack pipelines with run-time identity injection `test/components/retrievers/test_ferpa_metadata_retriever.py` - 16 tests covering: init, serialization, filter structure, cross-student isolation, cross-institution isolation, runtime identity override, missing-identity validation errors, and pipeline integration ## Design rationale The identity filter is applied at the document store query (pre-filter), not as a post-retrieval pass. This means unauthorized documents are never surfaced in the candidate set regardless of their semantic similarity to the query — which is the correct security posture for a regulated record-access environment. The compound AND filter matches both student_id AND institution_id. A student from institution A cannot retrieve records from institution B even if their student_id happens to match a record there. ## FERPA context FERPA (20 U.S.C. § 1232g; 34 CFR Part 99) governs access to education records in US higher-education systems. This component addresses the retrieval-layer identity isolation requirement. Callers are responsible for session authentication, document category authorization, and 34 CFR § 99.32 disclosure logging.
|
@ashutoshrana is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
Author
|
Closing for now — will reopen after internal review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a
FERPAMetadataRetrievercomponent tohaystack/components/retrievers/that enforces FERPA (Family Educational Rights and Privacy Act) identity boundaries in retrieval-augmented generation pipelines for higher-education deployments.Problem
Higher-education teams building RAG systems on Haystack need to ensure that student records do not cross identity or institutional boundaries during retrieval. The naive approach — retrieval followed by post-filtering — is insufficient for regulated environments: unauthorized documents enter the candidate pool and reach the scoring layer before being discarded, leaving a gap if the post-filter has a defect.
The correct approach is to apply the identity constraint at the document store query (pre-filter), so unauthorized documents never enter the retrieval pipeline regardless of their semantic similarity to the query. Haystack's
FilterRetrieverprovides the mechanism; this component wraps it with FERPA-specific semantics, validation, and documentation.What this adds
haystack/components/retrievers/ferpa_metadata_retriever.pyFERPAMetadataRetriever— a@component-decorated retriever that acceptsstudent_idandinstitution_id(at init or run time) and builds a compound AND filter over those metadata fields before querying the document storestudent_id_field,institution_id_field) for deployments using non-standard namingrun()andrun_async()both implementedto_dict()/from_dict()serialization following existing patternstest/components/retrievers/test_ferpa_metadata_retriever.py16 tests covering:
meta.field prefix)ValueErrorfor both fieldsDesign notes
The component enforces identity at query time rather than post-retrieval because:
The component addresses the retrieval-layer isolation requirement only. Callers remain responsible for session authentication, document category authorization (e.g., excluding counseling notes), and 34 CFR § 99.32 disclosure logging.
FERPA context
FERPA (20 U.S.C. § 1232g; 34 CFR Part 99) governs access to education records in US higher-education institutions. Its requirements are directly analogous to HIPAA's minimum-necessary standard for healthcare records and GLBA's safeguards rule for financial records — the same retrieval-layer pattern applies to all three.
Checklist
@componentdecorator pattern used throughouthaystack/components/retrievers/run()andrun_async()both implemented with@component.output_typesto_dict()/from_dict()serialization implementedhaystack/components/retrievers/__init__.pyvia_import_structure