Facet counts and aggregations for Vector search by shanbady · Pull Request #3210 · mitodl/mit-learn

shanbady · 2026-04-15T20:04:13Z

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/10641

Description (What does it do?)

This PR adds aggregations and returning facet counts in the learning resources and contentfiles endpoints. The request and response format is 1-1 with the existing opensearch endpoint (so we can simply switch the endpoint and the UI will work for both)

How can this be tested?

checkout this branch
make sure you have resources loaded and embeddings (atleast for learning resources). if not - run python manage.py generate_embeddings --courses --skip-contentfiles --recreate-collections
login as a root/admin user locally
visit your local search page on the lift panel you should see an option to enable "vector hybrid search". Once enabled, try a few searches you should see facets and counts working as they do with the search endpoints.

Additional Fixes

There was a bug that occurred due to vector_search.constants.CONTENT_FILES_RETRIEVE_PAYLOAD being set to only ["key", "run_readable_id"] that is specific to the contentfiles endpoint. Unlike the learning resources endpoint where we expect a 1-1 mapping of items in the qdrant collection to learning resources in the database, the vector search collection contains many items that do not map to contentfiles in the database (like the course metadata document) - also the vector endpoint returns "content_chunk" instead of "content" in the payload response.

The fix was to revert to previous payload behavior on just the contentfile collection by setting CONTENT_FILES_RETRIEVE_PAYLOAD to True so it returns all the payload from qdrant which it then falls back to during the serialization process.

The fix is in this one commit

testing the new bug fix

The easiest way to test this issue+fix is:

checkout the commit just before the bug fix git checkout 9b381d3d528136dd5aee7f16606b4b210c608298 (this is a commit on main with the issue just prior to me reverting the changes that caused it)
make sure you have some contentfiles embedded locally
visit your local qdrant contentfiles collection in the dashboard
find some random point in the collection
click the "edit payload" button
change the "resource_readable_id", "key", and "run_readable_id" to "test123" (so they dont map to any real contentfile in the database) and save
visit the contentfiles endpoint (include some query q= param and key=test123) link
note that it only returns 1 record with the "key" attribute
checkout this branch - restart your web container and visit the same link and note that all the payload from qdrant is returned

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions · 2026-04-15T20:04:29Z

OpenAPI Changes

3 changes: 0 error, 0 warning, 3 info

View full changelog

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

gitguardian · 2026-04-15T23:13:53Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
29838531	Triggered	reCAPTCHA Key	`803baf0`	env/shared.local.example.env	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

Copilot

Pull request overview

Adds facet aggregations (facet counts) to the Qdrant/vector search endpoints for learning resources and content files, matching the response shape used by the existing OpenSearch endpoints so the UI can switch between them.

Changes:

Add Qdrant facet aggregation support and return metadata.aggregations from vector search endpoints.
Adjust vector search payload retrieval behavior (notably for content files) and fix scroll mocking/offset handling.
Update frontend vector-search query params (including aggregations) and regenerate OpenAPI + TS client.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
vector_search/views_test.py	Update Qdrant scroll mock return shape to `(points, next_offset)`
vector_search/views.py	Add aggregation fetching and adjust payload/search params + offset/scroll behavior
vector_search/utils_test.py	Add unit tests for `async_qdrant_aggregations`
vector_search/utils.py	Implement `async_qdrant_aggregations`, introduce `COLLECTION_PARAM_MAP`, tweak hits extraction
vector_search/serializers.py	Add `aggregations` request param, return `metadata.aggregations`, add `published` param
vector_search/constants.py	Add new param-map keys and retrieval payload constants + `COLLECTION_PARAM_MAP`
openapi/specs/v0.yaml	Document new query params (`aggregations`, `published`) in vector endpoints
main/settings.py	Change default hybrid prefetch tuning values
frontends/main/src/page-components/SearchDisplay/SearchDisplay.tsx	Pass `aggregations` through to vector search requests and unify query option selection
frontends/main/src/app-pages/SearchPage/SearchPage.test.tsx	Remove a vector search behavior test (needs replacement/update)
frontends/api/src/generated/v0/api.ts	Regenerated client types/params for new vector endpoint query params

Copilot · 2026-04-15T23:24:42Z

  test("Toggling facets", async () => {
    setMockApiResponses({


This PR removes the test covering vector search request construction/behavior. Since vector search params now include aggregations (and the count UI behavior changed), this should be updated rather than deleted to preserve coverage that the correct endpoint/params are used and the UI renders expected count/facets.

Copilot · 2026-04-15T23:24:42Z

+    aggregation_choices = [
+        (key, key.replace("_", " ").title()) for key in QDRANT_RESOURCE_PARAM_MAP
+    ]
+    aggregations = serializers.ListField(
+        required=False,
+        child=serializers.ChoiceField(choices=aggregation_choices),
+        help_text=(
+            f"aggregations for facet counts \
+            \n\n{build_choice_description_list(aggregation_choices)}"
+        ),
+    )


The aggregations ChoiceField is currently built from all keys in QDRANT_RESOURCE_PARAM_MAP, which diverges from the OpenSearch API’s allowed aggregation set (LEARNING_RESOURCE_AGGREGATIONS in learning_resources_search/serializers.py). If the goal is 1:1 request compatibility, consider restricting this list to the same aggregation keys as OpenSearch (and/or explicitly documenting any intentional differences), since allowing high-cardinality fields like readable_id, title, or url can be expensive to facet on.

Copilot · 2026-04-15T23:24:42Z

+    aggregation_choices = [
+        (key, key.replace("_", " ").title()) for key in QDRANT_CONTENT_FILE_PARAM_MAP
+    ]
+    aggregations = serializers.ListField(
+        required=False,
+        child=serializers.ChoiceField(choices=aggregation_choices),
+        help_text=(
+            f"aggregations for facet counts \
+            \n\n{build_choice_description_list(aggregation_choices)}"
+        ),
+    )


The aggregations ChoiceField is currently built from all keys in QDRANT_CONTENT_FILE_PARAM_MAP, which diverges from the OpenSearch content-file search API’s aggregation allowlist (CONTENT_FILE_AGGREGATIONS in learning_resources_search/serializers.py). If compatibility is the goal, consider restricting these choices to the same keys to avoid exposing expensive/high-cardinality facet fields (e.g., checksum/description) and to keep client expectations aligned.

Copilot · 2026-04-15T23:24:42Z

 VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER = get_int(
-    name="VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER", default=20
+    name="VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER", default=5
 )
 VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT = get_int(
-    name="VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT", default=10000
+    name="VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT", default=500
 )


The defaults for VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER (20 → 5) and VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT (10000 → 500) change hybrid vector search behavior globally and may affect result quality/recall. If this is an intentional tuning change, it would help to document the rationale (or keep the previous defaults and rely on env overrides) to avoid surprising behavior changes across environments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mbertrand

Works well and the contenfile search at http://open.odl.local:8065/api/v0/vector_content_files_search/?q=testing&resource_readable_id=caebd8f2-d6d9-496a-b112-4487e7d6d3d4 returns a valid result for the course's marketing page. But there is a conflict with main that needs to be resolves/rebased and it might be good to bring back the deleted frontpage test

mbertrand · 2026-04-16T14:11:04Z

-    const hideCountText = screen.queryByText("700 results")
-    expect(hideCountText).toBeNull()
-  })
-


Agree with the copilot review, probably better to update this test rather than remove it altogether, if possible.

restored and updated

mbertrand

👍

shanbady and others added 28 commits April 8, 2026 13:37

adding aggregation generation method

878907c

adding aggregations to response

d011419

adding some optimizations and aggregations to response

8e9aa44

regen spec

f36d1cb

add published back to learning resources serializer

ab34a23

spec update

22c8d43

show facets on frontend

518c5a1

fixing aggregation counts

73a080a

fix test

f3168d6

fix typechecks

937ddf7

remove unused test

1c1ea40

adding tests for aggregations

11dfdc8

Update vector_search/serializers.py

23d9bb4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update vector_search/views.py

7abd6f9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fixing 'with_payload' for group by

45a3011

switch to safe getter

ab4c0f8

correct comment about dropping admin params

d310643

switching collection param map to constant

ef6c227

adding aggregation params for contentfiles

3e1ed7a

regenerate spec

d09c4b5

adding fix for hybrid search offset

493fc35

fix tests for new expected response

c1b848f

fix contentfile metadata

de908ef

make hits and get_results same for both serializers

2668365

fixing skip with relation to offsets

a17581b

tune prefetch multiplier

15663a3

gather count with hits

7d1c544

adding fix for fields returned by contentfile endpoint

9a8f7db

Merge branch 'main' into shanbady/vector-search-facets

803baf0

shanbady marked this pull request as ready for review April 15, 2026 23:19

Copilot AI review requested due to automatic review settings April 15, 2026 23:19

shanbady added the Needs Review An open Pull Request that is ready for review label Apr 15, 2026

Copilot started reviewing on behalf of shanbady April 15, 2026 23:20 View session

sentry Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread vector_search/serializers.py Outdated

Copilot AI reviewed Apr 15, 2026

View reviewed changes

shanbady and others added 2 commits April 15, 2026 19:26

default hits to list

027114c

Update vector_search/utils.py

d654892

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mbertrand self-assigned this Apr 16, 2026

mbertrand reviewed Apr 16, 2026

View reviewed changes

shanbady added 4 commits April 16, 2026 10:23

Merge branch 'main' into shanbady/vector-search-facets

f58178d

restore and update js test for vector hybrid search facet results

13c0593

move published to resource specific serializer field

2348970

update spec

7fa59dd

mbertrand approved these changes Apr 16, 2026

View reviewed changes

mbertrand added Waiting on author and removed Needs Review An open Pull Request that is ready for review labels Apr 16, 2026

shanbady merged commit 041edaf into main Apr 16, 2026
14 checks passed

shanbady deleted the shanbady/vector-search-facets branch April 16, 2026 16:02

This was referenced Apr 16, 2026

Release 0.63.6 #3219

Closed

Release 0.63.7 #3224

Closed

Release 0.64.0 #3226

Merged

Uh oh!

Conversation

shanbady commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Additional Fixes

testing the new bug fix

Uh oh!

github-actions Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAPI Changes

Uh oh!

gitguardian Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

mbertrand left a comment

Choose a reason for hiding this comment

Uh oh!

mbertrand Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

shanbady Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

mbertrand left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shanbady commented Apr 15, 2026 •

edited

Loading

github-actions Bot commented Apr 15, 2026 •

edited

Loading

gitguardian Bot commented Apr 15, 2026 •

edited

Loading