Service Catalog Matching: Architectural Approaches

Problem Statement

Given a catalog of service descriptions — each consisting of natural language prose plus a formal API type descriptor (e.g., a TypeScript interface) — implement a matching service that accepts a short natural language query describing a problem, need, or set of requirements, and returns candidate services whose APIs might satisfy those desiderata.

The catalog may be large enough to overflow an LLM context window, ruling out a naive "feed everything to a prompt" approach.

Core Architecture: Two-Stage Retrieve-then-Rerank

The fundamental pattern is:

Fast first-stage retrieval — get a candidate set (top ~20–50) from a large index cheaply
Expensive second-stage reranking — use an LLM to deeply score candidates against the query

This lets the LLM context window focus on a manageable number of candidates rather than the whole catalog.

Indexing Pipeline (Offline)

The most important design decision is what you embed. Raw embedding of the original text is suboptimal because:

NL descriptions and TypeScript interfaces live in different semantic spaces
Embedding models vary widely in how well they handle code vs. prose

Better approach: generate multiple index representations per entry at ingest time.

LLM-Generated Query-Facing Summaries

For each catalog entry, prompt an LLM: "Given this service description and its API type, what kinds of problems, requirements, or queries should this service match?" Embed the answer. This normalizes descriptions into a retrieval-optimized form — essentially teaching the index to answer the question "what problem does this solve?" rather than "what is this thing?"

Separate Embeddings per Facet

Generate distinct embeddings for:

The NL description (use a general text embedding model)
An LLM-generated prose rendering of the TypeScript interface (e.g., "this function takes a user ID and returns a promise of their account balance")
The combined LLM summary above

Store all of these in your vector DB. At query time, search all facets and fuse results.

BM25 / Full-Text Index

Also maintain a keyword index alongside the vector index. Queries containing specific type names, method names, or technical terms will recall better via BM25 than dense retrieval. Hybrid search (dense + sparse, fused via Reciprocal Rank Fusion or similar) consistently outperforms either alone.

Query Processing (Online)

Query Expansion

Before embedding the query, use an LLM to expand it: generate 3–5 alternative phrasings, infer implied requirements, and extract key concepts. Embed each variant and union the top-K results. This trades latency for recall and is especially valuable for short or underspecified queries.

TypeScript-Aware Query Analysis

If the query contains or implies type constraints ("I need something that converts X to Y", "I need a function that takes an array of..."), parse these out and use them as a separate retrieval signal. Structural type matching against indexed interface schemas can complement semantic matching.

Reranking Stage

Take the top ~20–50 candidates from retrieval and feed them to an LLM with the original query. Ask it to:

Score each candidate on relevance (0–10)
Explain why — this becomes useful output to the user
Identify partial matches (e.g., "this API does most of what you need but lacks X")

The LLM context at this stage is manageable: 20–50 short entries fit comfortably within a standard context window.

Type-Specific Considerations

The TypeScript interface is an underutilized signal. Techniques to exploit it:

Generate NL documentation from the interface at index time. LLMs are good at this, and it makes type signatures searchable via NL queries.
Extract structural metadata — arity, parameter types, return type, method names — and use these as filterable facets alongside the vector index (e.g., "only return services with async methods").
Type compatibility matching — if the query implies a specific signature shape, attempt structural or nominal subtype matching against indexed interfaces as a pre-filter.

Technology Choices

Component	Options
Vector DB	pgvector (if you already have Postgres), Qdrant, Weaviate
Embedding model	OpenAI `text-embedding-3-large`, Cohere `embed-v3`, or a local model via Ollama
Hybrid fusion	Reciprocal Rank Fusion (RRF) is simple and effective
Reranker	Cohere Rerank API, or a direct LLM call

Key Insights

Don't embed raw content — embed LLM-synthesized query-facing summaries. This is the single highest-leverage optimization.
Use hybrid (dense + sparse) retrieval. Neither alone is as good as the combination.
Separate facets for NL and type structure — the two halves of each catalog entry benefit from different embedding strategies.
The LLM belongs in the reranking stage, not the retrieval stage — that's where its reasoning ability earns its cost.
Query expansion before retrieval significantly improves recall for short, underspecified queries.

The two-stage pipeline (cheap broad retrieval → expensive narrow reranking) scales to catalogs of any size, since vector index retrieval is effectively O(log n).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service Catalog Matching: Architectural Approaches

Service Catalog Matching: Architectural Approaches

Problem Statement

Core Architecture: Two-Stage Retrieve-then-Rerank

Indexing Pipeline (Offline)

LLM-Generated Query-Facing Summaries

Separate Embeddings per Facet

BM25 / Full-Text Index

Query Processing (Online)

Query Expansion

TypeScript-Aware Query Analysis

Reranking Stage

Type-Specific Considerations

Technology Choices

Key Insights

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally