Skip to content

Add FERPAMetadataRetriever component for regulated higher-education RAG#11079

Closed
ashutoshrana wants to merge 1 commit intodeepset-ai:mainfrom
ashutoshrana:add-ferpa-metadata-retriever
Closed

Add FERPAMetadataRetriever component for regulated higher-education RAG#11079
ashutoshrana wants to merge 1 commit intodeepset-ai:mainfrom
ashutoshrana:add-ferpa-metadata-retriever

Conversation

@ashutoshrana
Copy link
Copy Markdown

Summary

This PR adds a FERPAMetadataRetriever component to haystack/components/retrievers/ that enforces FERPA (Family Educational Rights and Privacy Act) identity boundaries in retrieval-augmented generation pipelines for higher-education deployments.

Problem

Higher-education teams building RAG systems on Haystack need to ensure that student records do not cross identity or institutional boundaries during retrieval. The naive approach — retrieval followed by post-filtering — is insufficient for regulated environments: unauthorized documents enter the candidate pool and reach the scoring layer before being discarded, leaving a gap if the post-filter has a defect.

The correct approach is to apply the identity constraint at the document store query (pre-filter), so unauthorized documents never enter the retrieval pipeline regardless of their semantic similarity to the query. Haystack's FilterRetriever provides the mechanism; this component wraps it with FERPA-specific semantics, validation, and documentation.

What this adds

haystack/components/retrievers/ferpa_metadata_retriever.py

  • FERPAMetadataRetriever — a @component-decorated retriever that accepts student_id and institution_id (at init or run time) and builds a compound AND filter over those metadata fields before querying the document store
  • Configurable field names (student_id_field, institution_id_field) for deployments using non-standard naming
  • run() and run_async() both implemented
  • to_dict() / from_dict() serialization following existing patterns
  • Pipeline-compatible: identity can be injected at run time from an upstream authentication component

test/components/retrievers/test_ferpa_metadata_retriever.py

16 tests covering:

  • Initialization with and without identity values
  • Custom metadata field names
  • Serialization round-trip
  • Filter structure (AND operator, meta. field prefix)
  • Cross-student isolation (Alice's query cannot return Bob's records)
  • Cross-institution isolation (univ-east query cannot return univ-west records)
  • Runtime identity override (run-time values take precedence over init-time)
  • Missing-identity ValueError for both fields
  • Pipeline integration with init-time and run-time identity

Design notes

The component enforces identity at query time rather than post-retrieval because:

  1. Documents never enter the candidate set — no semantic ranking of unauthorized content occurs
  2. The behavior is deterministic — a cross-identity result cannot appear regardless of embedding similarity
  3. The approach is consistent with how student information systems already implement field-level access control

The component addresses the retrieval-layer isolation requirement only. Callers remain responsible for session authentication, document category authorization (e.g., excluding counseling notes), and 34 CFR § 99.32 disclosure logging.

FERPA context

FERPA (20 U.S.C. § 1232g; 34 CFR Part 99) governs access to education records in US higher-education institutions. Its requirements are directly analogous to HIPAA's minimum-necessary standard for healthcare records and GLBA's safeguards rule for financial records — the same retrieval-layer pattern applies to all three.

Checklist

  • Component follows @component decorator pattern used throughout haystack/components/retrievers/
  • run() and run_async() both implemented with @component.output_types
  • to_dict() / from_dict() serialization implemented
  • Exported from haystack/components/retrievers/__init__.py via _import_structure
  • 16 tests including cross-student/institution isolation and pipeline integration
  • SPDX license headers present

Adds a new retriever component that enforces FERPA (Family Educational
Rights and Privacy Act) identity boundaries at the document store query
layer, before any ranking or scoring occurs.

## What this adds

`haystack/components/retrievers/ferpa_metadata_retriever.py`
- `FERPAMetadataRetriever` — wraps any `DocumentStore` and builds a
  compound AND filter from `student_id` + `institution_id` metadata fields
- Identity values can be set at init time (session-scoped) or at run
  time (per-query override), with run-time values taking precedence
- Configurable metadata field names via `student_id_field` and
  `institution_id_field` for deployments using non-standard field naming
- `run()` and `run_async()` both implemented
- Full `to_dict()` / `from_dict()` serialization
- Works in Haystack pipelines with run-time identity injection

`test/components/retrievers/test_ferpa_metadata_retriever.py`
- 16 tests covering: init, serialization, filter structure, cross-student
  isolation, cross-institution isolation, runtime identity override,
  missing-identity validation errors, and pipeline integration

## Design rationale

The identity filter is applied at the document store query (pre-filter),
not as a post-retrieval pass. This means unauthorized documents are never
surfaced in the candidate set regardless of their semantic similarity to
the query — which is the correct security posture for a regulated
record-access environment.

The compound AND filter matches both student_id AND institution_id. A
student from institution A cannot retrieve records from institution B
even if their student_id happens to match a record there.

## FERPA context

FERPA (20 U.S.C. § 1232g; 34 CFR Part 99) governs access to education
records in US higher-education systems. This component addresses the
retrieval-layer identity isolation requirement. Callers are responsible
for session authentication, document category authorization, and
34 CFR § 99.32 disclosure logging.
@ashutoshrana ashutoshrana requested a review from a team as a code owner April 11, 2026 16:38
@ashutoshrana ashutoshrana requested review from bogdankostic and removed request for a team April 11, 2026 16:38
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 11, 2026

@ashutoshrana is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 11, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels Apr 11, 2026
@ashutoshrana
Copy link
Copy Markdown
Author

Closing for now — will reopen after internal review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants