Skip to content

Commit baa3553

Browse files
author
Isha Gupta
committed
Merge remote-tracking branch 'origin/main' into issue-4587
2 parents 1a85890 + c0ac784 commit baa3553

39 files changed

Lines changed: 6784 additions & 11 deletions

File tree

core/src/main/java/org/opensearch/sql/planner/Planner.java

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
import org.opensearch.sql.planner.optimizer.LogicalPlanOptimizer;
1515
import org.opensearch.sql.planner.physical.PhysicalPlan;
1616
import org.opensearch.sql.storage.Table;
17+
import org.opensearch.sql.storage.read.TableScanBuilder;
1718

1819
/** Planner that plans and chooses the optimal physical plan. */
1920
@RequiredArgsConstructor
@@ -34,7 +35,35 @@ public PhysicalPlan plan(LogicalPlan plan) {
3435
if (table == null) {
3536
return plan.accept(new DefaultImplementor<>(), null);
3637
}
37-
return table.implement(table.optimize(optimize(plan)));
38+
LogicalPlan optimized = table.optimize(optimize(plan));
39+
// Give scan builders a chance to reject shapes that push-down alone cannot express safely
40+
// (e.g. operators that land above the scan but outside its push-down contract).
41+
validateScanBuilders(optimized);
42+
return table.implement(optimized);
43+
}
44+
45+
/**
46+
* Walk the optimized plan and invoke {@link TableScanBuilder#validatePlan(LogicalPlan)} on every
47+
* scan builder, passing the fully optimized root so scan builders can inspect their ancestors.
48+
*/
49+
private void validateScanBuilders(LogicalPlan optimized) {
50+
optimized.accept(
51+
new LogicalPlanNodeVisitor<Void, Object>() {
52+
@Override
53+
public Void visitNode(LogicalPlan node, Object context) {
54+
for (LogicalPlan child : node.getChild()) {
55+
child.accept(this, context);
56+
}
57+
return null;
58+
}
59+
60+
@Override
61+
public Void visitTableScanBuilder(TableScanBuilder node, Object context) {
62+
node.validatePlan(optimized);
63+
return null;
64+
}
65+
},
66+
null);
3867
}
3968

4069
private Table findTable(LogicalPlan plan) {

core/src/main/java/org/opensearch/sql/storage/read/TableScanBuilder.java

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,19 @@ public boolean pushDownPageSize(LogicalPaginate paginate) {
119119
return false;
120120
}
121121

122+
/**
123+
* Post-optimization validation hook. Called once by the planner after all push-down rules have
124+
* run, with the fully optimized plan root. Subclasses may inspect the ancestors of this scan
125+
* builder to reject planner shapes that push-down alone cannot express safely (for example,
126+
* operators that land above the scan but outside its push-down contract and would be executed
127+
* after the scan has already returned a bounded result set). Default is no-op.
128+
*
129+
* @param root the fully optimized logical plan containing this scan builder
130+
*/
131+
public void validatePlan(LogicalPlan root) {
132+
// no-op by default
133+
}
134+
122135
@Override
123136
public <R, C> R accept(LogicalPlanNodeVisitor<R, C> visitor, C context) {
124137
return visitor.visitTableScanBuilder(this, context);

docs/user/dql/vector-search.rst

Lines changed: 331 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,331 @@
1+
2+
==============================
3+
Vector Search [Experimental]
4+
==============================
5+
6+
.. rubric:: Table of contents
7+
8+
.. contents::
9+
:local:
10+
:depth: 2
11+
12+
Introduction
13+
============
14+
15+
``vectorSearch()`` is an experimental feature. Syntax, options, and
16+
pushdown behavior may change in future releases based on feedback.
17+
18+
The ``vectorSearch()`` table function runs a k-NN query against a ``knn_vector``
19+
field and exposes the matching documents as a relation in the ``FROM`` clause.
20+
It relies on the OpenSearch `k-NN plugin
21+
<https://docs.opensearch.org/latest/vector-search/>`_. The target index must
22+
map the vector field as ``knn_vector`` and the index must be created with
23+
``index.knn: true``.
24+
25+
The SQL layer translates ``vectorSearch()`` into an OpenSearch search
26+
request whose body is native k-NN query DSL; the query vector is parsed
27+
into a numeric array before that DSL is emitted.
28+
29+
Relevance is expressed through the OpenSearch ``_score`` metadata field, and
30+
results are returned ordered by ``_score DESC`` by default.
31+
32+
vectorSearch
33+
============
34+
35+
Description
36+
-----------
37+
38+
``vectorSearch(table='<index>', field='<vector-field>', vector='<array>', option='<key=value[,key=value]*>')``
39+
40+
All four arguments are required and must be passed by name as string
41+
literals. Positional arguments, or a mix of positional and named
42+
arguments, are not supported. For example, the following is invalid::
43+
44+
FROM vectorSearch('my_vectors', field='embedding',
45+
vector='[0.1,0.2]', option='k=5') AS v
46+
47+
A table alias is required. Projected fields are referenced through the
48+
alias (``v._id``, ``v._score``, ``v.category``).
49+
50+
If the ``opensearch-knn`` plugin is not installed on the target cluster,
51+
query execution fails with a ``vectorSearch() requires the k-NN plugin``
52+
error. ``_explain`` continues to work without the plugin.
53+
54+
Arguments
55+
---------
56+
57+
- ``table``: single concrete index or alias to search. Wildcards
58+
(``*``), comma-separated multi-index targets, ``_all``, ``.``, and
59+
``..`` are not supported. The target index must have
60+
``index.knn: true`` and map the target field as ``knn_vector``. A
61+
normal alias name is accepted. If the alias resolves to multiple
62+
backing indices, the SQL layer does not prevalidate that every
63+
backing index has a compatible ``knn_vector`` mapping, dimension, or
64+
engine; OpenSearch execution remains the source of truth for those
65+
checks.
66+
- ``field``: name of the ``knn_vector`` field.
67+
- ``vector``: query vector as a JSON-style array of numbers, passed as a
68+
string (for example, ``'[0.1, 0.2, 0.3]'``). Components must be
69+
comma-separated finite numbers. Semicolon, colon, and pipe separators
70+
are not supported, and empty components (for example, ``'[1.0,,2.0]'``
71+
or ``'[1.0,]'``) return an error. The vector dimension must match the
72+
``knn_vector`` mapping on the target index.
73+
- ``option``: comma-separated ``key=value`` pairs. Exactly one of ``k``,
74+
``max_distance``, or ``min_score`` is required. ``filter_type`` is
75+
optional.
76+
77+
Supported option keys
78+
---------------------
79+
80+
Option keys are lower-case and case-sensitive. ``K=5`` or
81+
``Filter_Type=post`` returns an "Unknown option key" error.
82+
83+
- ``k``: top-k mode. Integer between 1 and 10000. The query returns up to
84+
``k`` nearest neighbors.
85+
- ``max_distance``: radial mode. Non-negative number. Matches documents
86+
within the given distance of the query vector. ``LIMIT`` is required and
87+
caps the returned rows.
88+
- ``min_score``: radial mode. Non-negative number. Matches documents with
89+
score at or above the given threshold. ``LIMIT`` is required and caps
90+
the returned rows.
91+
- ``filter_type``: ``post`` or ``efficient``. Controls how a ``WHERE``
92+
clause is applied. See `Filtering`_.
93+
94+
``k``, ``max_distance``, and ``min_score`` are mutually exclusive; specify
95+
exactly one.
96+
97+
Native k-NN tuning options (for example, ``method_parameters.ef_search``,
98+
``method_parameters.nprobes``, ``rescore.oversample_factor``) are not
99+
supported through ``vectorSearch()`` and return an "Unknown option
100+
key" error.
101+
102+
Syntax
103+
------
104+
105+
::
106+
107+
SELECT <projection>
108+
FROM vectorSearch(
109+
table='<index>',
110+
field='<vector-field>',
111+
vector='<array>',
112+
option='<key=value[,key=value]*>'
113+
) AS <alias>
114+
[WHERE <predicate on alias non-vector fields>]
115+
[ORDER BY <alias>._score DESC]
116+
[LIMIT <n>]
117+
118+
Example 1: Top-k
119+
----------------
120+
121+
Return the five nearest neighbors of a query vector::
122+
123+
POST /_plugins/_sql
124+
{
125+
"query" : """
126+
SELECT v._id, v._score
127+
FROM vectorSearch(
128+
table='my_vectors',
129+
field='embedding',
130+
vector='[0.1, 0.2, 0.3]',
131+
option='k=5'
132+
) AS v
133+
"""
134+
}
135+
136+
In top-k mode, the request size defaults to ``k``; adding ``LIMIT n`` further
137+
reduces the row count, but ``n`` must not exceed ``k``.
138+
139+
Example 2: Radial search (``max_distance``)
140+
-------------------------------------------
141+
142+
Return up to the specified ``LIMIT`` documents within a maximum distance
143+
of the query vector. ``LIMIT`` is required for radial searches; without
144+
it the result set would be unbounded::
145+
146+
POST /_plugins/_sql
147+
{
148+
"query" : """
149+
SELECT v._id, v._score
150+
FROM vectorSearch(
151+
table='my_vectors',
152+
field='embedding',
153+
vector='[0.1, 0.2, 0.3]',
154+
option='max_distance=0.5'
155+
) AS v
156+
LIMIT 100
157+
"""
158+
}
159+
160+
Example 3: Radial search (``min_score``)
161+
----------------------------------------
162+
163+
Return up to the specified ``LIMIT`` documents whose score is at or
164+
above the given threshold. ``LIMIT`` is required for radial searches;
165+
without it the result set would be unbounded::
166+
167+
POST /_plugins/_sql
168+
{
169+
"query" : """
170+
SELECT v._id, v._score
171+
FROM vectorSearch(
172+
table='my_vectors',
173+
field='embedding',
174+
vector='[0.1, 0.2, 0.3]',
175+
option='min_score=0.8'
176+
) AS v
177+
LIMIT 100
178+
"""
179+
}
180+
181+
Filtering
182+
=========
183+
184+
A ``WHERE`` clause on non-vector fields of the ``vectorSearch()`` alias is
185+
pushed down to OpenSearch when it can be translated to an OpenSearch filter.
186+
Two placement strategies are available via the ``filter_type`` option:
187+
188+
- ``efficient`` (default): the ``WHERE`` predicate is embedded directly
189+
inside the k-NN query (``knn.filter``), enabling native efficient
190+
k-NN filtering during vector search. Efficient filtering depends on
191+
native k-NN engine and method support; if the target index does not
192+
support ``knn.filter`` for the configured engine and method, set
193+
``filter_type=post``. See the `k-NN filtering guide
194+
<https://docs.opensearch.org/latest/vector-search/filter-search-knn/efficient-knn-filtering/>`_
195+
for engine and method requirements.
196+
- ``post``: the k-NN query is placed in a scoring (``bool.must``)
197+
context and the ``WHERE`` predicate is placed as a non-scoring
198+
``bool.filter`` outside the k-NN clause. This is Boolean filter
199+
placement, not the REST ``post_filter`` parameter, and may return
200+
fewer than ``k`` rows when the filter is selective.
201+
202+
Full-text predicates (``match``, ``match_phrase``, ``multi_match``, and
203+
the rest of the full-text family) under a ``WHERE`` clause are used as
204+
filters, not as hybrid keyword-vector score fusion. Their placement
205+
follows ``filter_type``: the default (``efficient``) embeds supported
206+
full-text predicates under ``knn.filter``, while ``post`` places them
207+
in ``bool.filter`` outside the k-NN clause. In both cases they restrict
208+
which candidates are retained but their text relevance score does not
209+
combine with the vector ``_score``. ``vectorSearch()`` is not a hybrid
210+
vector + text relevance scorer.
211+
212+
Behavior depends on whether ``filter_type`` is specified:
213+
214+
- **Omitted (default, ``efficient``)**: the ``WHERE`` predicate is
215+
embedded under ``knn.filter`` so the k-NN engine applies native
216+
efficient filtering during vector search. A query with no ``WHERE``
217+
clause is valid. ``efficient`` supports simple native filters:
218+
``term``, ``range``, ``wildcard``, ``exists``, full-text family
219+
(``match``, ``match_phrase``, ``match_phrase_prefix``,
220+
``match_bool_prefix``, ``multi_match``, ``query_string``,
221+
``simple_query_string``), and boolean combinations of those filters.
222+
Predicates that compile to script queries (arithmetic, function calls
223+
on indexed fields, ``CASE``, date math), nested predicates, and other
224+
query shapes are not supported under ``knn.filter`` and return an
225+
error. Set ``filter_type=post`` to apply such predicates after the
226+
k-NN search. If the predicate cannot be translated to an OpenSearch
227+
filter query at all (a distinct translation failure from the
228+
unsupported-shape cases above), the default path falls back to
229+
evaluating the ``WHERE`` clause in memory after the k-NN results are
230+
returned.
231+
- **Explicit ``efficient``**: same contract as the default. Specifying
232+
it is useful when a query should be explicit about the placement
233+
strategy and should fail if the predicate cannot be safely embedded
234+
under ``knn.filter``.
235+
- **Explicit ``post``**: a ``WHERE`` clause is required and must be
236+
translatable to an OpenSearch filter query. Predicates that translate
237+
to native OpenSearch queries are pushed down as a ``bool.filter``
238+
alongside the k-NN query. Predicates that do not have a native
239+
equivalent (for example, arithmetic or function calls on indexed
240+
fields) are pushed down as an OpenSearch script query and evaluated
241+
server-side. If predicate translation itself fails, the query returns
242+
an error; there is no silent in-memory fallback under explicit
243+
``post``. Use ``filter_type=post`` when the predicate shape is not
244+
supported by efficient filtering.
245+
246+
Example 4: Default efficient filtering (no ``filter_type``)
247+
-----------------------------------------------------------
248+
249+
::
250+
251+
POST /_plugins/_sql
252+
{
253+
"query" : """
254+
SELECT v._id, v._score, v.category
255+
FROM vectorSearch(
256+
table='my_vectors',
257+
field='embedding',
258+
vector='[0.1, 0.2, 0.3]',
259+
option='k=10'
260+
) AS v
261+
WHERE v.category = 'books'
262+
"""
263+
}
264+
265+
The predicate is embedded under ``knn.filter`` so the k-NN engine
266+
applies native efficient filtering during vector search.
267+
268+
Example 5: Post-filtering for predicates not supported by efficient mode
269+
------------------------------------------------------------------------
270+
271+
Use ``filter_type=post`` for predicates that do not fit the ``efficient``
272+
allow-list, such as arithmetic or function calls on indexed fields::
273+
274+
POST /_plugins/_sql
275+
{
276+
"query" : """
277+
SELECT v._id, v._score, v.category
278+
FROM vectorSearch(
279+
table='my_vectors',
280+
field='embedding',
281+
vector='[0.1, 0.2, 0.3]',
282+
option='k=10,filter_type=post'
283+
) AS v
284+
WHERE v.price * 1.1 < 100
285+
"""
286+
}
287+
288+
Scoring, sorting, and limits
289+
============================
290+
291+
- ``vectorSearch()`` exposes the OpenSearch ``_score`` metadata field on the
292+
alias. For an alias ``v``, select it as ``v._score``.
293+
- ``_score`` can be selected and referenced in ``ORDER BY``, but it cannot
294+
appear in ``WHERE``. Use ``option='min_score=...'`` for score-threshold
295+
vector search.
296+
- Results are returned in ``_score DESC`` order by default. The only
297+
supported ``ORDER BY`` expression is ``<alias>._score DESC`` (for
298+
example, ``v._score DESC``).
299+
- In top-k mode (``k=N``), ``LIMIT n`` is optional; when present, ``n`` must
300+
be ``≤ k``.
301+
- In radial mode (``max_distance`` or ``min_score``), ``LIMIT`` is required.
302+
- ``OFFSET`` is not supported on ``vectorSearch()``. Use ``LIMIT`` only.
303+
304+
Limitations
305+
===========
306+
307+
The following are not supported on ``vectorSearch()``:
308+
309+
- ``GROUP BY`` and aggregations directly over a ``vectorSearch()``
310+
relation are not supported and return an error.
311+
- Operators wrapped around a ``vectorSearch()`` subquery are rejected
312+
when they would run after ``vectorSearch()`` has already produced a
313+
finite result set, because they can silently yield zero, skipped, or
314+
incorrectly ordered rows. Specifically, an outer ``WHERE``,
315+
``ORDER BY``, ``OFFSET`` (non-zero), ``GROUP BY``, aggregation, or
316+
``DISTINCT`` applied to a ``vectorSearch()`` subquery returns an
317+
error. Place ``WHERE`` predicates inside the subquery, directly on
318+
the ``vectorSearch()`` alias, so that they participate in ``WHERE``
319+
pushdown. A plain outer ``LIMIT`` (without ``OFFSET``) wrapping a
320+
``vectorSearch()`` subquery is allowed and caps the returned rows.
321+
- ``JOIN`` between a ``vectorSearch()`` relation and another relation is
322+
not supported.
323+
- ``UNION`` / ``INTERSECT`` / ``EXCEPT`` combining a ``vectorSearch()``
324+
relation with another relation is not supported.
325+
- Multiple ``vectorSearch()`` calls in the same query are not supported.
326+
- The query vector must be supplied as a literal. Parameterized vectors
327+
(for example, values bound from another column) are not supported.
328+
- Indexes that define a user field named ``_score`` cannot be queried
329+
with ``vectorSearch()`` because ``_score`` is reserved for the
330+
synthetic vector score exposed on the alias. Rename the field or query
331+
the index with a plain ``SELECT``.

0 commit comments

Comments
 (0)