From 83d0bf0f74d002ab7c344da7c23c2c9c7e0ff2a6 Mon Sep 17 00:00:00 2001 From: jazairi <16103405+jazairi@users.noreply.github.com> Date: Thu, 7 Aug 2025 10:35:15 -0400 Subject: [PATCH 1/4] Add ADR for surfacing Primo CDI records in results Why these changes are being introduced: The proposed unified search interface would display records from Primo CDI (via Primo Search API) and Alma (via TIMDEX API) in the same results list. TIMDEX UI does not currently have a means to combine results from multiple APIs in this way. Relevant ticket(s): N/A How this addresses that need: This adds an ADR that outlines a proposed solution to this problem, by introducing a search orchestration layer that will handle API calls and results normalization. Side effects of this change: There are additional decisions to be made around the architecture of the search orchestrator, such as how to manage relevance normalization. These decisions are noted in the ADR and will be explored in future ADRs. --- ...03-surface-primo-cdi-records-in-results.md | 135 ++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md diff --git a/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md b/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md new file mode 100644 index 00000000..dc13cc32 --- /dev/null +++ b/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md @@ -0,0 +1,135 @@ +# 3. Surface Primo CDI records in results + +Date: 2025-08-07 + +## Status + +Accepted + +## Context + +The Libraries' unified search strategy calls for a discovery interface that surfaces results from +both Primo Central Discovery Index (CDI) and Alma (via TIMDEX), replacing the current [Bento UI](https://github.com/MITLibraries/bento). +In Bento, Alma and CDI results are displayed in separate boxes. The unified interface would +interleave CDI and TIMDEX records in the same results list. + +We considered adding a new Primo harvester to our ETL architecture to ingest CDI data into TIMDEX +API. However, this approach is not feasible for many reasons: + +- **Cost**: CDI contains over 5 billion records. Harvesting and storing these records would be impractical and expensive, both in terms of financial and compute resources. +- **Performance**: Expanding TIMDEX API at such a scale is likely to dramatically reduce the efficiency of our OpenSearch index. +- **Data availability**: Because Primo does not expose CDI records in OAI-PMH, we would need to harvest using the Primo Search API, making the process needlessly complex and perhaps impossible. +- **Licensing**: Harvesting CDI records for TIMDEX likely has licensing implications. Ex Libris seems to discourage the practice, as Primo does not provide OAI-PMH support, and the Search API caps records per request at 5,000 via the [`offset` parameter](https://developers.exlibrisgroup.com/primo/apis/docs/primoSearch/R0VUIC9wcmltby92MS9zZWFyY2g=/#output:~:text=Note%3A%20The%20Primo%20search%20API%20has%20a%20hardcoded%20offset%20limitation%20parameter%20of%205000.). + +## Decision + +We will surface CDI results in TIMDEX UI by querying the Primo Search API directly at runtime and +interleaving results with TIMDEX API results in the unified search interface. + +To achieve this, we will implement a search orchestrator that receives a query from TIMDEX UI and +dispatches it in parallel to TIMDEX API and Primo Search API. The orchestrator will normalize and +interleave the results before returning them to the UI. + +This approach aligns with the unified search strategy's goal to display all known results from +CDI and TIMDEX in the same interface. It also enables us to add the desired intelligent user +guidance, because we can render search interventions from TACOS and other external systems as +needed. + +### Proposed architecture + +```mermaid +sequenceDiagram + +participant UI as TIMDEX UI (frontend) +participant Orchestrator as Search Orchestrator (middleware) +participant TIMDEX as TIMDEX API (OpenSearch) +participant Primo as Primo Search API (CDI) +participant TACOS as TACOS (query enhancer) + +UI-->>Orchestrator: User submits search query +UI-->>TACOS: Send query to TACOS +TACOS-->>UI: Return patterns identified in query (e.g., suggested resources, citations, journal titles) +Orchestrator-->>TIMDEX: Send query to TIMDEX API +Orchestrator-->>Primo: Send query to Primo CDI API +TIMDEX-->>Orchestrator: Return TIMDEX results +Primo-->>Orchestrator: Return CDI results +Orchestrator->>Orchestrator: Normalize & interleave results +Orchestrator-->>UI: Return unified result set +UI->>UI: Render interventions based on TACOS response +UI->>UI: Render results in a single list +``` + +Search form submissions will be sent in parallel to the search orchestrator and TACOS (possibly +using Turbo frames, but implementation details are TBD). This will allow us to continue rendering +TACOS interventions rapidly, likely before results are returned to the UI. + +The orchestrator will make asynchronous calls to the TIMDEX and Primo Search APIs. Records in each +response will be normalized and interleaved into a unified set of results, then returned back to +TIMDEX UI. In addition to record metadata, relevance scores must also be normalized due to the +disparate sources. (See 'Relevance normalization' below for more details.) + +This architecture abstracts out most of the added complexity to the search orchestrator. The UI +will be responsible only for sending queries to external systems and rendering the returned data. +This abstraction will improve our discovery environment's maintainability by avoiding excessively +complex codebases. + +### Relevance normalization + +The interleaving of results from TIMDEX and CDI introduces the problem of relevance normalization. +While it is beyond the scope of this ADR to identify a solution this problem, it is something we +should consider as an important future step. + +Primo uses an opaque, proprietary relevance algorithm. While the algorithm is +[somewhat customizable](https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/020Primo_VE/Primo_VE_(English)/040Search_Configurations/Configuring_the_Ranking_of_Search_Results_in_Primo_VE), +we cannot assume any correlation between Primo scores and Okapi BM25 scores. + +Premature optimization is a risk here. If we normalize scores without understanding what results +are actually useful, we might miss an opportunity to improve the search experience. Therefore, we +should avoid implementing relevance normalization until we have useful analytics. These might +include: + +- Score distribution from each source +- User interaction data (e.g., do users click on CDI records more than TIMDEX records?) +- Usability testing data + +We could begin by implementing rank-based interleaving (i.e., the first two results in the unified +list would be the first two results from each source). While naive, such an algorithm would provide +an heuristic against which to measure future normalization attempts. + +Once we have more information, we could then evaluate different normalization strategies. Techniques +like [min-max](https://opensearch.org/blog/how-does-the-rank-normalization-work-in-hybrid-search/#:~:text=3.%20Min%2Dmax%20normalization%20technique) +or [z-score](https://spotintelligence.com/2025/02/14/z-score-normalization/) would be relatively +easy to implement. However, in order to make scores semantically comparable, it seems likely that we +would need an ML-backed approach that could also help with reranking. + +To that end, **we should strongly consider writing the search orchestrator in Python**, due to +greater availability of ML libraries. Alternatively, we can write the orchestrator in Rails and +tack on the normalization component as a Python microservice. + +## Consequences + +### Pros + +- Avoids duplicating CDI data or violating licensing terms. +- Enables real-time access to CDI content via Primo Search API. +- Supports the unified search vision without overloading TIMDEX API. + +### Cons + +- Requires runtime integration with Primo Search API, which may introduce latency or complexity. (We can mitigate this by implementing a caching strategy similar to that in Bento.) +- Limits computational access to CDI records (no bulk access via TIMDEX). +- Mixed-source results may confuse end users. + +### Future Considerations + +Usability testing and analytics will inform how we refine this feature. Depending on how users +interact with the single-stream UI, we may need visual clarification of each record's source API, or +separate tabs for TIMDEX and Primo records. + +Relevance normalization is a critical issue. We can begin with rank-based interleaving, but we +should not assume this to be a long-term solution. + +As previously mentioned, this solution does not provide computational access to CDI records via +TIMDEX. We should connect with the MIT research community to determine whether such access would +be useful. If there is a need, we could consider harvesting a subset of CDI records relevant to the +use case. \ No newline at end of file From f22382804bd5140da5e7154b8c16000ed81821cf Mon Sep 17 00:00:00 2001 From: jazairi <16103405+jazairi@users.noreply.github.com> Date: Mon, 11 Aug 2025 09:05:00 -0400 Subject: [PATCH 2/4] Revisions based on initial feedback --- ...03-surface-primo-cdi-records-in-results.md | 91 +++++++++++-------- 1 file changed, 51 insertions(+), 40 deletions(-) diff --git a/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md b/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md index dc13cc32..965877e9 100644 --- a/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md +++ b/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md @@ -13,22 +13,46 @@ both Primo Central Discovery Index (CDI) and Alma (via TIMDEX), replacing the cu In Bento, Alma and CDI results are displayed in separate boxes. The unified interface would interleave CDI and TIMDEX records in the same results list. +## Options considered + +### Harvest Primo CDI data + We considered adding a new Primo harvester to our ETL architecture to ingest CDI data into TIMDEX -API. However, this approach is not feasible for many reasons: +API. This would allow us to normalize CDI records as we do with other TIMDEX sources. Querying a +single API for Alma and CDI records would facilitate a single-stream view as desired in the unified +UI. Interleaving would no longer be necessary, as all records would be stored in OpenSearch. + +The harvester model would value beyond the scope of the TIMDEX UI redesign. By storing CDI records +in TIMDEX API, we could facilitate computational access to a massive corpus of data. + +Unfortunately, this approach is not feasible for many reasons: - **Cost**: CDI contains over 5 billion records. Harvesting and storing these records would be impractical and expensive, both in terms of financial and compute resources. - **Performance**: Expanding TIMDEX API at such a scale is likely to dramatically reduce the efficiency of our OpenSearch index. - **Data availability**: Because Primo does not expose CDI records in OAI-PMH, we would need to harvest using the Primo Search API, making the process needlessly complex and perhaps impossible. - **Licensing**: Harvesting CDI records for TIMDEX likely has licensing implications. Ex Libris seems to discourage the practice, as Primo does not provide OAI-PMH support, and the Search API caps records per request at 5,000 via the [`offset` parameter](https://developers.exlibrisgroup.com/primo/apis/docs/primoSearch/R0VUIC9wcmltby92MS9zZWFyY2g=/#output:~:text=Note%3A%20The%20Primo%20search%20API%20has%20a%20hardcoded%20offset%20limitation%20parameter%20of%205000.). -## Decision +### Display separate result streams in tabbed views + +This option would essentially be a different take on the Bento design. On the results page, a user +could tab between Alma results (labeled 'Books', 'MIT Catalog', etc.) and CDI results ('Articles'). -We will surface CDI results in TIMDEX UI by querying the Primo Search API directly at runtime and -interleaving results with TIMDEX API results in the unified search interface. +While arguably an improvement on Bento, this design does not deliver the combined Alma/CDI results +view as envisioned in the unified UI. -To achieve this, we will implement a search orchestrator that receives a query from TIMDEX UI and -dispatches it in parallel to TIMDEX API and Primo Search API. The orchestrator will normalize and -interleave the results before returning them to the UI. +### Implement external search orchestrator + +In this approach, we would surface CDI records in TIMDEX UI by querying the Primo Search API +directly at runtime and interleaving results with TIMDEX API results in the unified search +interface. + +To achieve this, we would implement a search orchestrator that receives a query from TIMDEX UI and +dispatches it in parallel to TIMDEX API and Primo Search API. The orchestrator would normalize and +interleave the results before returning them to the UI. This would allow us to display Alma and +CDI results in the same results list, without the feasibility concerns inherent in ingesting CDI +records into TIMDEX API. + +## Decision This approach aligns with the unified search strategy's goal to display all known results from CDI and TIMDEX in the same interface. It also enables us to add the desired intelligent user @@ -38,35 +62,22 @@ needed. ### Proposed architecture ```mermaid -sequenceDiagram - -participant UI as TIMDEX UI (frontend) -participant Orchestrator as Search Orchestrator (middleware) -participant TIMDEX as TIMDEX API (OpenSearch) -participant Primo as Primo Search API (CDI) -participant TACOS as TACOS (query enhancer) - -UI-->>Orchestrator: User submits search query -UI-->>TACOS: Send query to TACOS -TACOS-->>UI: Return patterns identified in query (e.g., suggested resources, citations, journal titles) -Orchestrator-->>TIMDEX: Send query to TIMDEX API -Orchestrator-->>Primo: Send query to Primo CDI API -TIMDEX-->>Orchestrator: Return TIMDEX results -Primo-->>Orchestrator: Return CDI results -Orchestrator->>Orchestrator: Normalize & interleave results -Orchestrator-->>UI: Return unified result set -UI->>UI: Render interventions based on TACOS response -UI->>UI: Render results in a single list +flowchart TD + A[User] -->|Submit search query| B[TIMDEX UI] + B -->|Send query| C[TACOS] + B -->|Send query| D[TIMDEX Search Orchestrator] + D -->|Send query| E[TIMDEX API] + D -->|Send query| F[Primo Search API] + E -->|Return results| D + F -->|Return results| D + D -->|Normalize & interleave results| B + C -->|Return interventions| B ``` -Search form submissions will be sent in parallel to the search orchestrator and TACOS (possibly -using Turbo frames, but implementation details are TBD). This will allow us to continue rendering -TACOS interventions rapidly, likely before results are returned to the UI. - -The orchestrator will make asynchronous calls to the TIMDEX and Primo Search APIs. Records in each -response will be normalized and interleaved into a unified set of results, then returned back to -TIMDEX UI. In addition to record metadata, relevance scores must also be normalized due to the -disparate sources. (See 'Relevance normalization' below for more details.) +The UI will dispatch the query in parallel to TACOS and the search orchestrator. TACOS responses are +then rendered immediately. The orchestrator waits for both TIMDEX and CDI responses, normalizes and +interleaves them, and returns a unified result set. This separation of concerns allows TACOS to +operate independently while the orchestrator handles result merging. This architecture abstracts out most of the added complexity to the search orchestrator. The UI will be responsible only for sending queries to external systems and rendering the returned data. @@ -94,7 +105,7 @@ include: We could begin by implementing rank-based interleaving (i.e., the first two results in the unified list would be the first two results from each source). While naive, such an algorithm would provide -an heuristic against which to measure future normalization attempts. +a baseline heuristic against which to measure future normalization attempts. Once we have more information, we could then evaluate different normalization strategies. Techniques like [min-max](https://opensearch.org/blog/how-does-the-rank-normalization-work-in-hybrid-search/#:~:text=3.%20Min%2Dmax%20normalization%20technique) @@ -117,7 +128,7 @@ tack on the normalization component as a Python microservice. ### Cons - Requires runtime integration with Primo Search API, which may introduce latency or complexity. (We can mitigate this by implementing a caching strategy similar to that in Bento.) -- Limits computational access to CDI records (no bulk access via TIMDEX). +- Limits computational access to CDI records (no bulk access via TIMDEX). While not a TIMDEX UI concern, this is worthy of consideration in the broader context of the TIMDEX ecosystem. - Mixed-source results may confuse end users. ### Future Considerations @@ -129,7 +140,7 @@ separate tabs for TIMDEX and Primo records. Relevance normalization is a critical issue. We can begin with rank-based interleaving, but we should not assume this to be a long-term solution. -As previously mentioned, this solution does not provide computational access to CDI records via -TIMDEX. We should connect with the MIT research community to determine whether such access would -be useful. If there is a need, we could consider harvesting a subset of CDI records relevant to the -use case. \ No newline at end of file +We should connect with the MIT research community to determine whether computational access to CDI +would be useful. If there is a need, we could consider harvesting a subset of CDI records relevant +to the use case. This would be a significant undertaking beyond the scope of the unified search +interface, but it aligns with the Libraries' mission, vision, and goals. \ No newline at end of file From f080a2019233adcc8cb2204a3da63e8150846011 Mon Sep 17 00:00:00 2001 From: jazairi <16103405+jazairi@users.noreply.github.com> Date: Mon, 11 Aug 2025 09:52:59 -0400 Subject: [PATCH 3/4] Incorporate initial feedback from Jeremy --- ...03-surface-primo-cdi-records-in-results.md | 40 +++++++++---------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md b/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md index 965877e9..c66f28bb 100644 --- a/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md +++ b/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md @@ -9,23 +9,18 @@ Accepted ## Context The Libraries' unified search strategy calls for a discovery interface that surfaces results from -both Primo Central Discovery Index (CDI) and Alma (via TIMDEX), replacing the current [Bento UI](https://github.com/MITLibraries/bento). +both Ex Libris Central Discovery Index (CDI) and Alma (via TIMDEX), replacing the current +[Bento UI](https://github.com/MITLibraries/bento). In Bento, Alma and CDI results are displayed in separate boxes. The unified interface would -interleave CDI and TIMDEX records in the same results list. +interleave CDI and TIMDEX records in the same results list, providing affordances (likely tabs) +to display CDI or TIMDEX results separately. ## Options considered ### Harvest Primo CDI data We considered adding a new Primo harvester to our ETL architecture to ingest CDI data into TIMDEX -API. This would allow us to normalize CDI records as we do with other TIMDEX sources. Querying a -single API for Alma and CDI records would facilitate a single-stream view as desired in the unified -UI. Interleaving would no longer be necessary, as all records would be stored in OpenSearch. - -The harvester model would value beyond the scope of the TIMDEX UI redesign. By storing CDI records -in TIMDEX API, we could facilitate computational access to a massive corpus of data. - -Unfortunately, this approach is not feasible for many reasons: +API. This approach is not feasible for many reasons: - **Cost**: CDI contains over 5 billion records. Harvesting and storing these records would be impractical and expensive, both in terms of financial and compute resources. - **Performance**: Expanding TIMDEX API at such a scale is likely to dramatically reduce the efficiency of our OpenSearch index. @@ -38,7 +33,8 @@ This option would essentially be a different take on the Bento design. On the re could tab between Alma results (labeled 'Books', 'MIT Catalog', etc.) and CDI results ('Articles'). While arguably an improvement on Bento, this design does not deliver the combined Alma/CDI results -view as envisioned in the unified UI. +view as envisioned in the unified UI. A superior design would include an 'Everything' tab as the +default, with TIMDEX and CDI tabs for users that want to refine further. ### Implement external search orchestrator @@ -54,10 +50,13 @@ records into TIMDEX API. ## Decision -This approach aligns with the unified search strategy's goal to display all known results from -CDI and TIMDEX in the same interface. It also enables us to add the desired intelligent user -guidance, because we can render search interventions from TACOS and other external systems as -needed. +We will implement an external search orchestrator that interleaves results from CDI and TIMDEX. +This combined results list will become the default display in TIMDEX UI. The UI will also provide +the option to display results from a single source. + +This approach aligns with the unified search strategy's goal to display all known results from CDI +and TIMDEX in the same interface. It also enables us to add the desired intelligent user guidance, +because we can render search interventions from TACOS and other external systems as needed. ### Proposed architecture @@ -127,7 +126,7 @@ tack on the normalization component as a Python microservice. ### Cons -- Requires runtime integration with Primo Search API, which may introduce latency or complexity. (We can mitigate this by implementing a caching strategy similar to that in Bento.) +- Requires runtime integration with Primo Search API, which will introduce latency and complexity. (We can mitigate this by implementing a caching strategy similar to that in Bento.) - Limits computational access to CDI records (no bulk access via TIMDEX). While not a TIMDEX UI concern, this is worthy of consideration in the broader context of the TIMDEX ecosystem. - Mixed-source results may confuse end users. @@ -140,7 +139,8 @@ separate tabs for TIMDEX and Primo records. Relevance normalization is a critical issue. We can begin with rank-based interleaving, but we should not assume this to be a long-term solution. -We should connect with the MIT research community to determine whether computational access to CDI -would be useful. If there is a need, we could consider harvesting a subset of CDI records relevant -to the use case. This would be a significant undertaking beyond the scope of the unified search -interface, but it aligns with the Libraries' mission, vision, and goals. \ No newline at end of file +We should connect with the MIT research community to determine their needs regarding computational +access to library data. While we cannot harvest CDI data for the aforementioned reasons, there may +be an alternative to CDI that could better support our users. Conducting this research would be a +significant undertaking beyond the scope of the unified search interface, but it aligns with the +Libraries' mission, vision, and goals. \ No newline at end of file From 2be361efe526c4a501aeeb532ae3e70744b5ea13 Mon Sep 17 00:00:00 2001 From: jazairi <16103405+jazairi@users.noreply.github.com> Date: Mon, 11 Aug 2025 09:55:07 -0400 Subject: [PATCH 4/4] Add note about requirement for orchestrator adr --- .../0003-surface-primo-cdi-records-in-results.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md b/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md index c66f28bb..f4fdc7e7 100644 --- a/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md +++ b/docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md @@ -58,6 +58,9 @@ This approach aligns with the unified search strategy's goal to display all know and TIMDEX in the same interface. It also enables us to add the desired intelligent user guidance, because we can render search interventions from TACOS and other external systems as needed. +An overview of the proposed architecture is below, but an additional ADR will be needed to explore +the implementation details. + ### Proposed architecture ```mermaid