|
| 1 | +# `spath` command on the analytics-engine route — current status |
| 2 | + |
| 3 | +Snapshot of `CalcitePPLSpathCommandIT` against the analytics-engine path |
| 4 | +(`-Dtests.analytics.force_routing=true -Dtests.analytics.parquet_indices=true` |
| 5 | +on `:integ-test:integTestRemote`), as of 2026-05-14. |
| 6 | + |
| 7 | +## Pass / fail summary |
| 8 | + |
| 9 | +| IT | Run | Pass | Fail | Notes | |
| 10 | +|---|---|---|---|---| |
| 11 | +| `CalcitePPLSpathCommandIT` | Baseline (upstream/main + coverage-bundle toggles) | 0 / 16 | 16 | 15 rejected at plan time on `JSON_EXTRACT_ALL` capability gap; 1 (`testSimpleSpath`) hit `EngineBackedIndexer.acquireReader` because PUT-created indices weren't parquet-backed | |
| 12 | +| `CalcitePPLSpathCommandIT` | After IT-infra refactor only | 1 / 16 | 15 | `testSimpleSpath` passes (uses `JSON_EXTRACT`, already in `STANDARD_PROJECT_OPS`); remaining 15 still on `JSON_EXTRACT_ALL` | |
| 13 | +| `CalcitePPLSpathCommandIT` | After Rust UDF + Java wiring | 10 / 16 | 6 | Auto-extract variants pass; 5 ITEM tests blocked on `map_extract` substrait + CHAR(N) key coercion; 1 WHERE test on `Unrecognized field type [MAP]` | |
| 14 | +| `CalcitePPLSpathCommandIT` | After ITEM/map_extract/CHAR→VARCHAR coercion + FieldType.MAP + filter capability | **16 / 16** | 0 | Full parity. v2/Calcite default path also 16/16 — no regression. | |
| 15 | + |
| 16 | +Reproduce: |
| 17 | + |
| 18 | +```bash |
| 19 | +./gradlew :integ-test:integTestRemote \ |
| 20 | + -Dtests.rest.cluster=localhost:9200 \ |
| 21 | + -Dtests.cluster=localhost:9300 \ |
| 22 | + -Dtests.clustername=runTask \ |
| 23 | + -Dtests.analytics.force_routing=true \ |
| 24 | + -Dtests.analytics.parquet_indices=true \ |
| 25 | + --tests "org.opensearch.sql.calcite.remote.CalcitePPLSpathCommandIT" |
| 26 | +``` |
| 27 | + |
| 28 | +## Bucket |
| 29 | + |
| 30 | +**Mixed Bucket 3 / M (Rust UDF) and Bucket 1 / S0 (capability registration), plus |
| 31 | +a small SQL-plugin-side IT-infrastructure fix.** Closing `spath` parity touches |
| 32 | +both halves of the stack — Layer 2 (capability registry) and Layer 3 (DataFusion |
| 33 | +runtime) — because `JSON_EXTRACT_ALL` is a PPL-internal UDF with no native |
| 34 | +DataFusion equivalent. The fix is shaped like a one-time investment that |
| 35 | +unblocks every future map-returning PPL function (the `LinkedHashMap<String, V>` |
| 36 | +pattern in `core/.../jsonUDF/`). |
| 37 | + |
| 38 | +## What landed |
| 39 | + |
| 40 | +The diff spans two repos: |
| 41 | + |
| 42 | +### `opensearch-project/OpenSearch` — analytics-engine side |
| 43 | + |
| 44 | +| File | Purpose | |
| 45 | +|---|---| |
| 46 | +| `sandbox/plugins/analytics-backend-datafusion/rust/src/udf/json_extract_all.rs` *(new, ~550 lines)* | Rust UDF returning `Map<Utf8, Utf8>`; mirrors `JsonExtractAllFunctionImpl` semantics (dot-path flatten, `{}` array marker, `[a, b, c]` merge format, `"null"` literal, malformed → empty map, top-level scalar → NULL). 16 unit tests. | |
| 47 | +| `sandbox/plugins/analytics-backend-datafusion/rust/src/udf/mod.rs` | Register `json_extract_all` in `register_all()` + module-info log message. | |
| 48 | +| `sandbox/plugins/analytics-backend-datafusion/src/main/resources/opensearch_scalar_functions.yaml` | Substrait extension entries for `json_extract_all(string) -> any1` and `map_extract(map<any1, any2>, any1) -> list<any2>`. | |
| 49 | +| `sandbox/libs/analytics-framework/.../ScalarFunction.java` | Add `JSON_EXTRACT_ALL(Category.SCALAR, SqlKind.OTHER_FUNCTION)` enum constant. | |
| 50 | +| `sandbox/libs/analytics-framework/.../FieldType.java` | Add `MAP("map")` enum constant and `case MAP -> FieldType.MAP` in `fromSqlTypeName`. | |
| 51 | +| `sandbox/plugins/analytics-backend-datafusion/.../JsonFunctionAdapters.java` | Add `JsonExtractAllAdapter` declaring `LOCAL_JSON_EXTRACT_ALL_OP` SqlFunction (return-type placeholder — `AbstractNameMappingAdapter#adapt` preserves the original PPL call's MAP RelDataType). | |
| 52 | +| `sandbox/plugins/analytics-backend-datafusion/.../ArrayElementAdapter.java` | Add `LOCAL_MAP_EXTRACT_OP` and a MAP-input branch in `adapt`: route `ITEM(map, key)` to `array_element(map_extract(map, CAST(key AS VARCHAR)), 1)`. The `array_element(…, 1)` wrap projects DataFusion's `List<value>` return down to a scalar; the CHAR→VARCHAR cast satisfies the substrait `any1` type-variable unification rule. | |
| 53 | +| `sandbox/plugins/analytics-backend-datafusion/.../DataFusionFragmentConvertor.java` | Register `json_extract_all` and `map_extract` in `ADDITIONAL_SCALAR_SIGS` so isthmus emits the right function reference. | |
| 54 | +| `sandbox/plugins/analytics-backend-datafusion/.../DataFusionAnalyticsBackendPlugin.java` | New `MAP_RETURNING_PROJECT_OPS` set (mirrors `ARRAY_RETURNING_PROJECT_OPS`) registered with `FieldType.MAP`. Register every operator in `STANDARD_FILTER_OPS` against `FieldType.MAP` so `where doc.user.name = 'John'` (which references the MAP column through ITEM) survives the filter-rule's field-index-keyed viability check. Adapter binding `JSON_EXTRACT_ALL → JsonExtractAllAdapter`. | |
| 55 | +| `sandbox/plugins/analytics-engine/.../ArrowValues.java` | Detect Arrow `MapVector` ahead of the ListVector branch and reassemble entries into a `LinkedHashMap<String, Object>` (Text→String normalization on both keys and values). Without this, the SQL plugin's response marshaller saw the raw Arrow Map layout (a `List<Map{keys, values}>` of entry structs) and rejected the values as `unsupported object class org.apache.arrow.vector.util.Text`. | |
| 56 | +| `gradle/run.gradle` | `arrow-flight-rpc` plugin block now also sets `opensearch.experimental.feature.transport.stream.enabled=true` so the analytics-engine + SQL-plugin co-install boots without the duplicate-handler Guice failure. | |
| 57 | + |
| 58 | +### `opensearch-project/sql` — SQL plugin side |
| 59 | + |
| 60 | +| File | Purpose | |
| 61 | +|---|---| |
| 62 | +| `core/.../data/model/ExprValueUtils.java` | (a) `tupleValue` widens its parameter type to `Map<?, Object>` so keys are normalized to strings regardless of the underlying Arrow `Text` type; (b) `fromObjectValue` gains a FQN-keyed branch for `org.apache.arrow.vector.util.Text` (string-decoded via `toString()`) so the response marshaller doesn't reject Arrow Map values. The FQN match keeps `core/` free of an Arrow dependency. | |
| 63 | +| `integ-test/.../calcite/remote/CalcitePPLSpathCommandIT.java` | Refactor `init()` to call `TestUtils.createIndexByRestClient` with an explicit `keyword` mapping for each of the four test indices (`test_spath`, `test_spath_auto`, `test_spath_cmd`, `test_spath_null`). Without explicit createIndex, the dynamic-mapping route bypassed `tests.analytics.parquet_indices=true`'s parquet-injection and `EngineBackedIndexer.acquireReader` failed at runtime. Idempotency via `TestUtils.isIndexExist` so the cluster-reuse pattern between `@Test` methods keeps working. | |
| 64 | + |
| 65 | +## Knock-on coverage |
| 66 | + |
| 67 | +The Rust UDF + capability registration work is reusable across the entire PPL |
| 68 | +map-returning function family. The same pattern (MAP_RETURNING_PROJECT_OPS, |
| 69 | +ITEM-on-Map dispatch via map_extract+array_element, MAP filter capability) now |
| 70 | +exists in the codebase, so a future PPL command that emits a `Map<VARCHAR, V>` |
| 71 | +column (e.g. `mapfrom` if ever lowered to a Calcite MAP RelDataType) is a |
| 72 | +one-line registration on top. |
| 73 | + |
| 74 | +The `ArrowValues.MapVector` branch and the `ExprValueUtils.Text` branch are |
| 75 | +also generic — they unlock any future UDF returning `Map<Utf8, Utf8>` through |
| 76 | +the analytics-engine route without further SQL-plugin work. |
| 77 | + |
| 78 | +## What was tricky |
| 79 | + |
| 80 | +- **Field-name alignment between Rust MapBuilder and Arrow Java's MapVector.** |
| 81 | + Arrow's Java `MapVector.KEY_NAME` / `VALUE_NAME` are `"key"` / `"value"` |
| 82 | + (singular). My initial Rust code used `"keys"` / `"values"` (plural), which |
| 83 | + serialized fine through the wire but turned `MapVector.getObject(i)` into a |
| 84 | + generic `JsonStringArrayList` of `Map<String, Object>` entries instead of a |
| 85 | + proper key-value map. Aligning the Rust `MapFieldNames` to singular fixed |
| 86 | + the round-trip. |
| 87 | +- **`map_extract` returns a list.** DataFusion's `map_extract(map, key)` returns |
| 88 | + `List<value>` because map semantics permit duplicate keys. PPL's |
| 89 | + `result.user.name` syntax assumes a scalar, so the ITEM adapter has to wrap |
| 90 | + the `map_extract` call in `array_element(…, 1)` to project the list down to |
| 91 | + a scalar. |
| 92 | +- **CHAR(N) literals don't unify with `any1` substrait type variables.** When |
| 93 | + the YAML signature uses `map<any1, any2>, any1` for the key-typed slot, a |
| 94 | + Calcite-emitted `'a.b'` CHAR(3) literal doesn't satisfy the type-variable |
| 95 | + binding. The adapter has to insert an explicit `CAST(key AS VARCHAR)` |
| 96 | + before the `map_extract` call. |
| 97 | +- **`MAP` had to be added to `FieldType` before `OpenSearchFilterRule` would |
| 98 | + even reach the EQUALS lookup.** Once added, `EQUALS on FieldType.MAP` still |
| 99 | + needed its own capability registration — the filter rule keys on the |
| 100 | + field's declared type, not the predicate's actual operand types, so without |
| 101 | + the explicit MAP entry the registry returned no viable backends. |
| 102 | +- **Two competing `:run` invocations stomp on the same data dir.** Repeated |
| 103 | + restart attempts during debugging accumulated multiple OpenSearch JVMs all |
| 104 | + pointing at `build/testclusters/runTask-0/data/...`, producing |
| 105 | + `Failed to initialize Parquet writer: No such file or directory` errors |
| 106 | + even though the data was being written. Killing all gradle wrappers + JVMs |
| 107 | + and starting one clean cluster cleared it. (Future task-runners should |
| 108 | + guard against this — gradle's testCluster bind to the same data dir |
| 109 | + regardless of which gradle wrapper started it.) |
| 110 | + |
| 111 | +## Reproduction artefacts |
| 112 | + |
| 113 | +Worktree: |
| 114 | +`/Users/ahkcs/IdeaProjects/sql/.claude/worktrees/spath-analytics-route` |
| 115 | +(branch `feature/spath-analytics-route`, based off |
| 116 | +`feature/ppl-coverage-bundle`). |
| 117 | + |
| 118 | +Run output (analytics-engine route): |
| 119 | +`integ-test/build/test-results/integTestRemote/TEST-org.opensearch.sql.calcite.remote.CalcitePPLSpathCommandIT.xml` |
| 120 | + |
| 121 | +Cluster log: |
| 122 | +`OpenSearch/build/testclusters/runTask-0/logs/runTask.log` |
| 123 | + |
| 124 | +## Files to read once before sending the PR |
| 125 | + |
| 126 | +The analytics-engine side is the bulk of the change. Open these in order to |
| 127 | +review the boundary contracts: |
| 128 | + |
| 129 | +- `OpenSearch/sandbox/plugins/analytics-backend-datafusion/rust/src/udf/json_extract_all.rs` |
| 130 | +- `OpenSearch/sandbox/plugins/analytics-backend-datafusion/src/main/resources/opensearch_scalar_functions.yaml` (the `map_extract` and `json_extract_all` entries) |
| 131 | +- `OpenSearch/sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/ArrayElementAdapter.java` (the MAP branch and CAST) |
| 132 | +- `OpenSearch/sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionAnalyticsBackendPlugin.java` (the `MAP_RETURNING_PROJECT_OPS` set and the MAP filter capability registration) |
| 133 | +- `OpenSearch/sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/ArrowValues.java` (MapVector flattening) |
| 134 | +- `sql/core/src/main/java/org/opensearch/sql/data/model/ExprValueUtils.java` (Arrow Text marshaller) |
| 135 | +- `sql/integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java` (parquet-backed index init) |
| 136 | + |
| 137 | +## PR shape |
| 138 | + |
| 139 | +For `opensearch-project/OpenSearch`, the natural commit-order is: |
| 140 | + |
| 141 | +1. Add `FieldType.MAP` + `ScalarFunction.JSON_EXTRACT_ALL` enum constants |
| 142 | + (framework SPI, no behavior change on its own). |
| 143 | +2. Add the `json_extract_all` Rust UDF (drop-in, isolated module). |
| 144 | +3. Register the Rust UDF in `udf/mod.rs` and the substrait extension YAML. |
| 145 | +4. Wire `JsonExtractAllAdapter` + `MAP_RETURNING_PROJECT_OPS` + |
| 146 | + `STANDARD_FILTER_OPS×FieldType.MAP` in the backend plugin. |
| 147 | +5. Add the `map_extract`/`array_element` ITEM-on-MAP dispatch in |
| 148 | + `ArrayElementAdapter` + YAML entry. |
| 149 | +6. `ArrowValues.MapVector` flattening. |
| 150 | +7. `gradle/run.gradle` — `transport.stream.enabled` flag. |
| 151 | + |
| 152 | +For `opensearch-project/sql`, a single PR: |
| 153 | + |
| 154 | +- `ExprValueUtils` — Arrow Text marshaller branch. |
| 155 | +- `CalcitePPLSpathCommandIT` — parquet-backed test indices. |
| 156 | + |
| 157 | +## Reference fixture |
| 158 | + |
| 159 | +The reference PR shape for a similar map-returning-function workstream |
| 160 | +mirrors this status doc. The fillnull worked example |
| 161 | +([`opensearch-project/OpenSearch#21472`](https://github.com/opensearch-project/OpenSearch/pull/21472)) |
| 162 | +is a strict Bucket-1 / S0 (one-line capability addition); spath is a |
| 163 | +mixed bucket that closes the missing infrastructure for any future |
| 164 | +map-returning PPL function at once. |
0 commit comments