Skip to content

Facet counts and aggregations for Vector search#3210

Merged
shanbady merged 35 commits into
mainfrom
shanbady/vector-search-facets
Apr 16, 2026
Merged

Facet counts and aggregations for Vector search#3210
shanbady merged 35 commits into
mainfrom
shanbady/vector-search-facets

Conversation

@shanbady
Copy link
Copy Markdown
Contributor

@shanbady shanbady commented Apr 15, 2026

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/10641

Description (What does it do?)

This PR adds aggregations and returning facet counts in the learning resources and contentfiles endpoints. The request and response format is 1-1 with the existing opensearch endpoint (so we can simply switch the endpoint and the UI will work for both)

How can this be tested?

  1. checkout this branch
  2. make sure you have resources loaded and embeddings (atleast for learning resources). if not - run python manage.py generate_embeddings --courses --skip-contentfiles --recreate-collections
  3. login as a root/admin user locally
  4. visit your local search page on the lift panel you should see an option to enable "vector hybrid search". Once enabled, try a few searches you should see facets and counts working as they do with the search endpoints.

Additional Fixes

There was a bug that occurred due to vector_search.constants.CONTENT_FILES_RETRIEVE_PAYLOAD being set to only ["key", "run_readable_id"] that is specific to the contentfiles endpoint. Unlike the learning resources endpoint where we expect a 1-1 mapping of items in the qdrant collection to learning resources in the database, the vector search collection contains many items that do not map to contentfiles in the database (like the course metadata document) - also the vector endpoint returns "content_chunk" instead of "content" in the payload response.

The fix was to revert to previous payload behavior on just the contentfile collection by setting CONTENT_FILES_RETRIEVE_PAYLOAD to True so it returns all the payload from qdrant which it then falls back to during the serialization process.

The fix is in this one commit

testing the new bug fix

The easiest way to test this issue+fix is:

  1. checkout the commit just before the bug fix git checkout 9b381d3d528136dd5aee7f16606b4b210c608298 (this is a commit on main with the issue just prior to me reverting the changes that caused it)
  2. make sure you have some contentfiles embedded locally
  3. visit your local qdrant contentfiles collection in the dashboard
  4. find some random point in the collection
  5. click the "edit payload" button Screenshot 2026-04-15 at 7 08 02 PM
  6. change the "resource_readable_id", "key", and "run_readable_id" to "test123" (so they dont map to any real contentfile in the database) and save
  7. visit the contentfiles endpoint (include some query q= param and key=test123) link
  8. note that it only returns 1 record with the "key" attribute
  9. checkout this branch - restart your web container and visit the same link and note that all the payload from qdrant is returned

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 15, 2026

OpenAPI Changes

3 changes: 0 error, 0 warning, 3 info

View full changelog

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

@gitguardian
Copy link
Copy Markdown

gitguardian Bot commented Apr 15, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
29838531 Triggered reCAPTCHA Key 803baf0 env/shared.local.example.env View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@shanbady shanbady marked this pull request as ready for review April 15, 2026 23:19
Copilot AI review requested due to automatic review settings April 15, 2026 23:19
@shanbady shanbady added the Needs Review An open Pull Request that is ready for review label Apr 15, 2026
Comment thread vector_search/serializers.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds facet aggregations (facet counts) to the Qdrant/vector search endpoints for learning resources and content files, matching the response shape used by the existing OpenSearch endpoints so the UI can switch between them.

Changes:

  • Add Qdrant facet aggregation support and return metadata.aggregations from vector search endpoints.
  • Adjust vector search payload retrieval behavior (notably for content files) and fix scroll mocking/offset handling.
  • Update frontend vector-search query params (including aggregations) and regenerate OpenAPI + TS client.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
vector_search/views_test.py Update Qdrant scroll mock return shape to (points, next_offset)
vector_search/views.py Add aggregation fetching and adjust payload/search params + offset/scroll behavior
vector_search/utils_test.py Add unit tests for async_qdrant_aggregations
vector_search/utils.py Implement async_qdrant_aggregations, introduce COLLECTION_PARAM_MAP, tweak hits extraction
vector_search/serializers.py Add aggregations request param, return metadata.aggregations, add published param
vector_search/constants.py Add new param-map keys and retrieval payload constants + COLLECTION_PARAM_MAP
openapi/specs/v0.yaml Document new query params (aggregations, published) in vector endpoints
main/settings.py Change default hybrid prefetch tuning values
frontends/main/src/page-components/SearchDisplay/SearchDisplay.tsx Pass aggregations through to vector search requests and unify query option selection
frontends/main/src/app-pages/SearchPage/SearchPage.test.tsx Remove a vector search behavior test (needs replacement/update)
frontends/api/src/generated/v0/api.ts Regenerated client types/params for new vector endpoint query params

Comment thread vector_search/serializers.py
Comment thread vector_search/serializers.py
Comment thread vector_search/utils.py
Comment thread vector_search/utils.py Outdated
Comment on lines 155 to 156
test("Toggling facets", async () => {
setMockApiResponses({
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removes the test covering vector search request construction/behavior. Since vector search params now include aggregations (and the count UI behavior changed), this should be updated rather than deleted to preserve coverage that the correct endpoint/params are used and the UI renders expected count/facets.

Copilot generated this review using guidance from repository custom instructions.
Comment thread vector_search/serializers.py Outdated
Comment on lines +42 to +52
aggregation_choices = [
(key, key.replace("_", " ").title()) for key in QDRANT_RESOURCE_PARAM_MAP
]
aggregations = serializers.ListField(
required=False,
child=serializers.ChoiceField(choices=aggregation_choices),
help_text=(
f"aggregations for facet counts \
\n\n{build_choice_description_list(aggregation_choices)}"
),
)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregations ChoiceField is currently built from all keys in QDRANT_RESOURCE_PARAM_MAP, which diverges from the OpenSearch API’s allowed aggregation set (LEARNING_RESOURCE_AGGREGATIONS in learning_resources_search/serializers.py). If the goal is 1:1 request compatibility, consider restricting this list to the same aggregation keys as OpenSearch (and/or explicitly documenting any intentional differences), since allowing high-cardinality fields like readable_id, title, or url can be expensive to facet on.

Copilot uses AI. Check for mistakes.
Comment on lines +221 to +231
aggregation_choices = [
(key, key.replace("_", " ").title()) for key in QDRANT_CONTENT_FILE_PARAM_MAP
]
aggregations = serializers.ListField(
required=False,
child=serializers.ChoiceField(choices=aggregation_choices),
help_text=(
f"aggregations for facet counts \
\n\n{build_choice_description_list(aggregation_choices)}"
),
)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregations ChoiceField is currently built from all keys in QDRANT_CONTENT_FILE_PARAM_MAP, which diverges from the OpenSearch content-file search API’s aggregation allowlist (CONTENT_FILE_AGGREGATIONS in learning_resources_search/serializers.py). If compatibility is the goal, consider restricting these choices to the same keys to avoid exposing expensive/high-cardinality facet fields (e.g., checksum/description) and to keep client expectations aligned.

Copilot uses AI. Check for mistakes.
Comment thread main/settings.py
Comment on lines 824 to 829
VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER = get_int(
name="VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER", default=20
name="VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER", default=5
)
VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT = get_int(
name="VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT", default=10000
name="VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT", default=500
)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaults for VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER (20 → 5) and VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT (10000 → 500) change hybrid vector search behavior globally and may affect result quality/recall. If this is an intentional tuning change, it would help to document the rationale (or keep the previous defaults and rely on env overrides) to avoid surprising behavior changes across environments.

Copilot uses AI. Check for mistakes.
shanbady and others added 2 commits April 15, 2026 19:26
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@mbertrand mbertrand self-assigned this Apr 16, 2026
Copy link
Copy Markdown
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well and the contenfile search at http://open.odl.local:8065/api/v0/vector_content_files_search/?q=testing&resource_readable_id=caebd8f2-d6d9-496a-b112-4487e7d6d3d4 returns a valid result for the course's marketing page. But there is a conflict with main that needs to be resolves/rebased and it might be good to bring back the deleted frontpage test

const hideCountText = screen.queryByText("700 results")
expect(hideCountText).toBeNull()
})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the copilot review, probably better to update this test rather than remove it altogether, if possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored and updated

Copy link
Copy Markdown
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mbertrand mbertrand added Waiting on author and removed Needs Review An open Pull Request that is ready for review labels Apr 16, 2026
@shanbady shanbady merged commit 041edaf into main Apr 16, 2026
14 checks passed
@shanbady shanbady deleted the shanbady/vector-search-facets branch April 16, 2026 16:02
This was referenced Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants