Use Case
I'm building an AI agent system that uses Hindsight's recall API for memory retrieval. My application needs to make intelligent decisions about which retrieved results to act on — not just consume everything returned, but selectively use results based on their confidence/relevance level. Today I have to enable trace: true and parse debug output just to get any score signal, which is verbose and not suitable for production use.
Problem Statement
Currently, the recall API does not return any numeric score alongside results. The documentation acknowledges that scores exist internally but intentionally omits them from the response, citing that "raw retrieval scores are not meaningful on an absolute scale." While I understand the reasoning around relative ordering, this design creates three concrete problems:
-
Transparency: Users cannot see why certain results ranked higher than others. The ordering alone gives no insight into the confidence gap between results.
-
Debugging: Diagnosing poor retrieval quality requires enabling trace: true, which dumps verbose internal debug data. This is not practical in production pipelines and leaks internal implementation details.
-
Custom filtering: Advanced users cannot implement threshold-based filtering logic. For example, in an agent system, I may want to discard results below a certain relevance confidence rather than blindly passing all top_k results to the LLM context.
Scores ARE already computed internally (cross-encoder score, temporal proximity boost, recency boost, final combined score) and ARE already exposed via trace: true. The gap is simply that there's no lightweight way to get the final combined score without opting into full trace mode.
How This Feature Would Help
With this feature, I would be able to:
- Filter low-confidence results in my agent pipeline before passing context to an LLM, reducing hallucination risk from irrelevant memories
- Understand retrieval quality at a glance without needing to enable trace mode or parse verbose debug payloads
- Implement dynamic
top_k logic — e.g., take up to 10 results but stop early if score drops below a threshold
- Monitor retrieval health in production by logging score distributions over time, enabling proactive detection of embedding drift or recall degradation
- Build user-facing explanations that show why a particular memory was surfaced in an AI application
Proposed API change (non-breaking):
Add an optional include_scores parameter (default: false) to the recall request. When enabled, each result object includes a score field with the final normalized combined score (0.0–1.0):
{
"results": [
{
"id": "...",
"content": "...",
"score": 0.87
}
]
}
This is strictly additive — it does not change existing behavior and preserves the current default experience. Advanced users who want the full breakdown can still use trace: true.
Proposed Solution
It would be great if Hindsight could expose scores in the recall response via an opt-in flag (e.g., include_scores: true). The score returned should be the final normalized combined score already computed by the ranking pipeline — no new computation needed.
Additionally, an optional score_threshold request parameter would allow server-side filtering, so users don't have to over-fetch and filter client-side:
{
"query": "...",
"top_k": 10,
"include_scores": true,
"score_threshold": 0.6
}
This keeps the API ergonomic for simple use cases (no scores by default) while unlocking powerful patterns for advanced users building production agent systems.
Alternatives Considered
No response
Priority
Nice to have
Additional Context
No response
Checklist
Use Case
I'm building an AI agent system that uses Hindsight's recall API for memory retrieval. My application needs to make intelligent decisions about which retrieved results to act on — not just consume everything returned, but selectively use results based on their confidence/relevance level. Today I have to enable
trace: trueand parse debug output just to get any score signal, which is verbose and not suitable for production use.Problem Statement
Currently, the recall API does not return any numeric score alongside results. The documentation acknowledges that scores exist internally but intentionally omits them from the response, citing that "raw retrieval scores are not meaningful on an absolute scale." While I understand the reasoning around relative ordering, this design creates three concrete problems:
Transparency: Users cannot see why certain results ranked higher than others. The ordering alone gives no insight into the confidence gap between results.
Debugging: Diagnosing poor retrieval quality requires enabling
trace: true, which dumps verbose internal debug data. This is not practical in production pipelines and leaks internal implementation details.Custom filtering: Advanced users cannot implement threshold-based filtering logic. For example, in an agent system, I may want to discard results below a certain relevance confidence rather than blindly passing all
top_kresults to the LLM context.Scores ARE already computed internally (cross-encoder score, temporal proximity boost, recency boost, final combined score) and ARE already exposed via
trace: true. The gap is simply that there's no lightweight way to get the final combined score without opting into full trace mode.How This Feature Would Help
With this feature, I would be able to:
top_klogic — e.g., take up to 10 results but stop early if score drops below a thresholdProposed API change (non-breaking):
Add an optional
include_scoresparameter (default:false) to the recall request. When enabled, each result object includes ascorefield with the final normalized combined score (0.0–1.0):{ "results": [ { "id": "...", "content": "...", "score": 0.87 } ] }This is strictly additive — it does not change existing behavior and preserves the current default experience. Advanced users who want the full breakdown can still use
trace: true.Proposed Solution
It would be great if Hindsight could expose scores in the recall response via an opt-in flag (e.g.,
include_scores: true). The score returned should be the final normalized combined score already computed by the ranking pipeline — no new computation needed.Additionally, an optional
score_thresholdrequest parameter would allow server-side filtering, so users don't have to over-fetch and filter client-side:{ "query": "...", "top_k": 10, "include_scores": true, "score_threshold": 0.6 }This keeps the API ergonomic for simple use cases (no scores by default) while unlocking powerful patterns for advanced users building production agent systems.
Alternatives Considered
No response
Priority
Nice to have
Additional Context
No response
Checklist