Facet counts and aggregations for Vector search#3210
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
OpenAPI Changes3 changes: 0 error, 0 warning, 3 info Unexpected changes? Ensure your branch is up-to-date with |
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 29838531 | Triggered | reCAPTCHA Key | 803baf0 | env/shared.local.example.env | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
There was a problem hiding this comment.
Pull request overview
Adds facet aggregations (facet counts) to the Qdrant/vector search endpoints for learning resources and content files, matching the response shape used by the existing OpenSearch endpoints so the UI can switch between them.
Changes:
- Add Qdrant facet aggregation support and return
metadata.aggregationsfrom vector search endpoints. - Adjust vector search payload retrieval behavior (notably for content files) and fix scroll mocking/offset handling.
- Update frontend vector-search query params (including
aggregations) and regenerate OpenAPI + TS client.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| vector_search/views_test.py | Update Qdrant scroll mock return shape to (points, next_offset) |
| vector_search/views.py | Add aggregation fetching and adjust payload/search params + offset/scroll behavior |
| vector_search/utils_test.py | Add unit tests for async_qdrant_aggregations |
| vector_search/utils.py | Implement async_qdrant_aggregations, introduce COLLECTION_PARAM_MAP, tweak hits extraction |
| vector_search/serializers.py | Add aggregations request param, return metadata.aggregations, add published param |
| vector_search/constants.py | Add new param-map keys and retrieval payload constants + COLLECTION_PARAM_MAP |
| openapi/specs/v0.yaml | Document new query params (aggregations, published) in vector endpoints |
| main/settings.py | Change default hybrid prefetch tuning values |
| frontends/main/src/page-components/SearchDisplay/SearchDisplay.tsx | Pass aggregations through to vector search requests and unify query option selection |
| frontends/main/src/app-pages/SearchPage/SearchPage.test.tsx | Remove a vector search behavior test (needs replacement/update) |
| frontends/api/src/generated/v0/api.ts | Regenerated client types/params for new vector endpoint query params |
| test("Toggling facets", async () => { | ||
| setMockApiResponses({ |
There was a problem hiding this comment.
This PR removes the test covering vector search request construction/behavior. Since vector search params now include aggregations (and the count UI behavior changed), this should be updated rather than deleted to preserve coverage that the correct endpoint/params are used and the UI renders expected count/facets.
| aggregation_choices = [ | ||
| (key, key.replace("_", " ").title()) for key in QDRANT_RESOURCE_PARAM_MAP | ||
| ] | ||
| aggregations = serializers.ListField( | ||
| required=False, | ||
| child=serializers.ChoiceField(choices=aggregation_choices), | ||
| help_text=( | ||
| f"aggregations for facet counts \ | ||
| \n\n{build_choice_description_list(aggregation_choices)}" | ||
| ), | ||
| ) |
There was a problem hiding this comment.
The aggregations ChoiceField is currently built from all keys in QDRANT_RESOURCE_PARAM_MAP, which diverges from the OpenSearch API’s allowed aggregation set (LEARNING_RESOURCE_AGGREGATIONS in learning_resources_search/serializers.py). If the goal is 1:1 request compatibility, consider restricting this list to the same aggregation keys as OpenSearch (and/or explicitly documenting any intentional differences), since allowing high-cardinality fields like readable_id, title, or url can be expensive to facet on.
| aggregation_choices = [ | ||
| (key, key.replace("_", " ").title()) for key in QDRANT_CONTENT_FILE_PARAM_MAP | ||
| ] | ||
| aggregations = serializers.ListField( | ||
| required=False, | ||
| child=serializers.ChoiceField(choices=aggregation_choices), | ||
| help_text=( | ||
| f"aggregations for facet counts \ | ||
| \n\n{build_choice_description_list(aggregation_choices)}" | ||
| ), | ||
| ) |
There was a problem hiding this comment.
The aggregations ChoiceField is currently built from all keys in QDRANT_CONTENT_FILE_PARAM_MAP, which diverges from the OpenSearch content-file search API’s aggregation allowlist (CONTENT_FILE_AGGREGATIONS in learning_resources_search/serializers.py). If compatibility is the goal, consider restricting these choices to the same keys to avoid exposing expensive/high-cardinality facet fields (e.g., checksum/description) and to keep client expectations aligned.
| VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER = get_int( | ||
| name="VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER", default=20 | ||
| name="VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER", default=5 | ||
| ) | ||
| VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT = get_int( | ||
| name="VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT", default=10000 | ||
| name="VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT", default=500 | ||
| ) |
There was a problem hiding this comment.
The defaults for VECTOR_HYBRID_SEARCH_PREFETCH_MULTIPLIER (20 → 5) and VECTOR_HYBRID_SEARCH_PREFETCH_MAX_LIMIT (10000 → 500) change hybrid vector search behavior globally and may affect result quality/recall. If this is an intentional tuning change, it would help to document the rationale (or keep the previous defaults and rely on env overrides) to avoid surprising behavior changes across environments.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
mbertrand
left a comment
There was a problem hiding this comment.
Works well and the contenfile search at http://open.odl.local:8065/api/v0/vector_content_files_search/?q=testing&resource_readable_id=caebd8f2-d6d9-496a-b112-4487e7d6d3d4 returns a valid result for the course's marketing page. But there is a conflict with main that needs to be resolves/rebased and it might be good to bring back the deleted frontpage test
| const hideCountText = screen.queryByText("700 results") | ||
| expect(hideCountText).toBeNull() | ||
| }) | ||
|
|
There was a problem hiding this comment.
Agree with the copilot review, probably better to update this test rather than remove it altogether, if possible.
There was a problem hiding this comment.
restored and updated
What are the relevant tickets?
Closes https://github.com/mitodl/hq/issues/10641
Description (What does it do?)
This PR adds aggregations and returning facet counts in the learning resources and contentfiles endpoints. The request and response format is 1-1 with the existing opensearch endpoint (so we can simply switch the endpoint and the UI will work for both)
How can this be tested?
python manage.py generate_embeddings --courses --skip-contentfiles --recreate-collectionsAdditional Fixes
There was a bug that occurred due to vector_search.constants.CONTENT_FILES_RETRIEVE_PAYLOAD being set to only
["key", "run_readable_id"]that is specific to the contentfiles endpoint. Unlike the learning resources endpoint where we expect a 1-1 mapping of items in the qdrant collection to learning resources in the database, the vector search collection contains many items that do not map to contentfiles in the database (like the course metadata document) - also the vector endpoint returns "content_chunk" instead of "content" in the payload response.The fix was to revert to previous payload behavior on just the contentfile collection by setting CONTENT_FILES_RETRIEVE_PAYLOAD to True so it returns all the payload from qdrant which it then falls back to during the serialization process.
The fix is in this one commit
testing the new bug fix
The easiest way to test this issue+fix is:
git checkout 9b381d3d528136dd5aee7f16606b4b210c608298(this is a commit on main with the issue just prior to me reverting the changes that caused it)