-
Notifications
You must be signed in to change notification settings - Fork 5
Service Catalog Matching: Architectural Approaches
Given a catalog of service descriptions — each consisting of natural language prose plus a formal API type descriptor (e.g., a TypeScript interface) — implement a matching service that accepts a short natural language query describing a problem, need, or set of requirements, and returns candidate services whose APIs might satisfy those desiderata.
The catalog may be large enough to overflow an LLM context window, ruling out a naive "feed everything to a prompt" approach.
The fundamental pattern is:
- Fast first-stage retrieval — get a candidate set (top ~20–50) from a large index cheaply
- Expensive second-stage reranking — use an LLM to deeply score candidates against the query
This lets the LLM context window focus on a manageable number of candidates rather than the whole catalog.
The most important design decision is what you embed. Raw embedding of the original text is suboptimal because:
- NL descriptions and TypeScript interfaces live in different semantic spaces
- Embedding models vary widely in how well they handle code vs. prose
Better approach: generate multiple index representations per entry at ingest time.
For each catalog entry, prompt an LLM: "Given this service description and its API type, what kinds of problems, requirements, or queries should this service match?" Embed the answer. This normalizes descriptions into a retrieval-optimized form — essentially teaching the index to answer the question "what problem does this solve?" rather than "what is this thing?"
Generate distinct embeddings for:
- The NL description (use a general text embedding model)
- An LLM-generated prose rendering of the TypeScript interface (e.g., "this function takes a user ID and returns a promise of their account balance")
- The combined LLM summary above
Store all of these in your vector DB. At query time, search all facets and fuse results.
Also maintain a keyword index alongside the vector index. Queries containing specific type names, method names, or technical terms will recall better via BM25 than dense retrieval. Hybrid search (dense + sparse, fused via Reciprocal Rank Fusion or similar) consistently outperforms either alone.
Before embedding the query, use an LLM to expand it: generate 3–5 alternative phrasings, infer implied requirements, and extract key concepts. Embed each variant and union the top-K results. This trades latency for recall and is especially valuable for short or underspecified queries.
If the query contains or implies type constraints ("I need something that converts X to Y", "I need a function that takes an array of..."), parse these out and use them as a separate retrieval signal. Structural type matching against indexed interface schemas can complement semantic matching.
Take the top ~20–50 candidates from retrieval and feed them to an LLM with the original query. Ask it to:
- Score each candidate on relevance (0–10)
- Explain why — this becomes useful output to the user
- Identify partial matches (e.g., "this API does most of what you need but lacks X")
The LLM context at this stage is manageable: 20–50 short entries fit comfortably within a standard context window.
The TypeScript interface is an underutilized signal. Techniques to exploit it:
- Generate NL documentation from the interface at index time. LLMs are good at this, and it makes type signatures searchable via NL queries.
- Extract structural metadata — arity, parameter types, return type, method names — and use these as filterable facets alongside the vector index (e.g., "only return services with async methods").
- Type compatibility matching — if the query implies a specific signature shape, attempt structural or nominal subtype matching against indexed interfaces as a pre-filter.
| Component | Options |
|---|---|
| Vector DB | pgvector (if you already have Postgres), Qdrant, Weaviate |
| Embedding model | OpenAI text-embedding-3-large, Cohere embed-v3, or a local model via Ollama |
| Hybrid fusion | Reciprocal Rank Fusion (RRF) is simple and effective |
| Reranker | Cohere Rerank API, or a direct LLM call |
- Don't embed raw content — embed LLM-synthesized query-facing summaries. This is the single highest-leverage optimization.
- Use hybrid (dense + sparse) retrieval. Neither alone is as good as the combination.
- Separate facets for NL and type structure — the two halves of each catalog entry benefit from different embedding strategies.
- The LLM belongs in the reranking stage, not the retrieval stage — that's where its reasoning ability earns its cost.
- Query expansion before retrieval significantly improves recall for short, underspecified queries.
The two-stage pipeline (cheap broad retrieval → expensive narrow reranking) scales to catalogs of any size, since vector index retrieval is effectively O(log n).