refactor: store property values as typed RDF literals, drop fn/parse_literal by HexaField · Pull Request #842 · coasys/ad4m

HexaField · 2026-06-04T00:26:49Z

Summary

Stacked on #837. Completes the storage-layer optimisation: link targets for literal-typed property values move from URI-shaped <literal:string:foo> NamedNodes to typed RDF terms ("foo"^^xsd:string, "42"^^xsd:integer, "3.14"^^xsd:decimal, "true"^^xsd:boolean, "{…}"^^<ad4m://json>). The custom fn/parse_literal SPARQL function is gone — WHERE / projection / Ops filters all compile to native typed comparisons. SDK wire format is unchanged; the storage layer round-trips typed literals back to literal:string:X URLs at query_links / query() boundaries so anything that reads link targets keeps seeing the same strings.

Why bother, since #837 already gives us indexed POS-probe equality? Two things:

Ops comparisons get the index too. gt / lt / between / contains / not were the one branch refactor: deterministic literal property targets + indexed WHERE filters #837 left wrapped in fn/parse_literal (the BIND + STR wrapper had to materialise every row to a typed value before comparing). With typed literals in storage, those filters compile to native xsd comparisons (?val > "42"^^xsd:integer) and use the same index lookups equality does.
fn/parse_literal registration goes away. The custom function exists today only because the old IRI shape needed it. After this PR there is no callsite — string equality reads typed literals directly; numeric comparison uses xsd ordering; envelope unwrap is handled by the migration on first boot.

What changed

Storage layer (`sparql_store.rs`)

target_to_storage_term(target: &str) -> Term parses literal: prefixes and returns the appropriate typed RDF literal (xsd:string / :integer / :decimal / :boolean / ad4m://json). Non-literal targets stay as NamedNodes. Inverse storage_term_to_target_string(&Term) -> String rebuilds the URL form for the SDK wire.
make_direct_triple now returns (NamedNode, NamedNode, Term); insert_link_triples / remove_link / for_each_matched_link / link_from_solution / query() updated accordingly.
Reifier IRI hash continues to derive from the wire-format target string, so identities are stable across the storage flip — the migration rewrites in place without orphaning reifiers.
query() serialises typed literals back to wire form only when the bound SPARQL variable is ?target / ?t, preserving STR(?x) = "true" filter semantics for non-target variables and keeping COUNT() results integer-string-shaped.

Migrations

v3 (migrate_signed_envelopes_to_plain_literals, from refactor: deterministic literal property targets + indexed WHERE filters #837) now lands directly on typed-literal storage instead of literal:string:-IRI form.
v4 (migrate_iri_literals_to_typed_literals, new) walks reifiers, finds any quad whose object is a Term::NamedNode matching literal:(string|number|boolean|json):.*, and rewrites the triple + reifier triple-term to use typed-literal storage. Idempotent.
Both wired into initialize_from_db; per-step migration log messages preserved.

WHERE / projection / Ops

is_literal_prop equality emits ?source <pred> "X"^^xsd:string (or :integer / :decimal / :boolean); arrays use VALUES ?x { …typed-literal terms… }. The constructor-default raw-IRI UNION fallback is preserved for the few resolveLanguage='literal' properties that hold raw URIs.
Ops branch runs native typed comparisons:
- gt / gte / lt / lte / between use format_literal_number + xsd:integer or xsd:decimal typed values.
- contains uses STR(?val) (works on both typed literals and IRIs, defensive for the upgrade window).
- not / not_array use direct != / NOT IN against typed literals.
The BIND(STR(<ad4m://fn/parse_literal>(?_pw_X)) AS …) line is gone. No per-row function calls anywhere in WHERE.

Cleanup

parse_literal_fn definition + registration removed. The function no longer needs to exist.
bench_indexed_iri_vs_fn_parse_literal_filter deleted — the pre-refactor form it benchmarked against doesn't exist anymore. The original perf delta is documented in #837's description.
MCP mention-waker subscription (mcp/tools/subscriptions.rs) now compares against STR(?target) directly.
Property-sort path drops the SUBSTR(STR(…), 16) literal-IRI slice — STR() returns the lexical form for typed literals directly.

Last-write-wins / scalar aggregation pushdown

Explored in #846. The straightforward nested-aggregate plan hits an Oxigraph 0.5.8 planner cliff — confirmed empirically at ~23,000× regression on test_perf_flux_message_parent_scope_paginated, exactly the failure mode the ac57680b9 commit message warned about. The most plausible unblock is storage-level partitioning from named graphs (#812) — once each subject instance has its own graph, the inner MAX(?ts) GROUP BY ?source ?predicate aggregate operates on a partitioned working set rather than scanning the whole reifier index. Re-attempt with GRAPH ?g scoping once #812 lands.

Correcting earlier framing: SPARQL itself does not introduce window functions in either 1.1 or the W3C 1.2 draft, so the wait is on Oxigraph planner / dataset semantics rather than a SPARQL spec release.

Wind tunnel — S8 (Subject Class Queries) vs `dev`

Three-branch run on Apple Silicon (48 GB / 14 CPU) with a fresh CUSTOM_DENO_SNAPSHOT.bin regenerated against each branch's own Deno deps. Mean of 5 runs per query, lower is faster; <1.0× is an improvement over dev.

========= TIER: small (~1.9k links) =========
query                       dev          #837         #842        837/dev  842/dev
totalItemCount             0.32 ms      0.32 ms      0.28 ms      1.00x    0.88x
allItems                   0.97 ms      0.96 ms      0.94 ms      0.99x    0.97x
unprocessedItems           0.37 ms      0.37 ms      0.35 ms      1.00x    0.95x
recentConversations        0.23 ms      0.22 ms      0.18 ms      0.96x    0.78x
pinnedConversations        0.11 ms      0.11 ms      0.08 ms      1.00x    0.73x
subgroupItemsData          0.21 ms      0.22 ms      0.19 ms      1.05x    0.90x
subgroupTopics             0.11 ms      0.11 ms      0.08 ms      1.00x    0.73x
messageHydration           0.12 ms      0.12 ms      0.09 ms      1.00x    0.75x
paginatedMessages          1.15 ms      1.15 ms      1.14 ms      1.00x    0.99x

========= TIER: medium (~58k links) =========
query                       dev          #837         #842        837/dev  842/dev
totalItemCount             3.09 ms      3.08 ms      3.06 ms      1.00x    0.99x
allItems                  23.94 ms     26.19 ms     23.82 ms      1.09x    0.99x
unprocessedItems           7.16 ms      7.27 ms      7.17 ms      1.02x    1.00x
recentConversations        0.53 ms      0.52 ms      0.52 ms      0.98x    0.98x
pinnedConversations        0.13 ms      0.12 ms      0.13 ms      0.92x    1.00x
subgroupItemsData          0.28 ms      0.26 ms      0.27 ms      0.93x    0.96x
subgroupTopics             0.14 ms      0.12 ms      0.13 ms      0.86x    0.93x
messageHydration           0.14 ms      0.14 ms      0.14 ms      1.00x    1.00x
paginatedMessages         39.39 ms     40.66 ms     39.37 ms      1.03x    1.00x

Headlines:

No regressions. Across 18 query/tier combinations, the largest slowdown is allItems at medium tier on refactor: deterministic literal property targets + indexed WHERE filters #837 (1.09×, +2.25 ms on a 24 ms query) which sits within run-to-run noise — the same query on refactor: store property values as typed RDF literals, drop fn/parse_literal #842 lands at 0.99×.
refactor: store property values as typed RDF literals, drop fn/parse_literal #842 is consistently fastest at the small tier (0.73×–0.99× across the board), with double-digit wins on the property-light queries (pinnedConversations, subgroupTopics, messageHydration, recentConversations all 22–27% faster). These are exactly the queries where eliminating fn/parse_literal's per-row BIND+FILTER overhead matters most relative to total query cost.
At the medium tier, refactor: store property values as typed RDF literals, drop fn/parse_literal #842 trends to neutral — reifier-metadata join cost dominates and amortises away the per-row literal handling.
Write throughput is unchanged: ~15.5s to seed 58k links across all three branches.

How to read this vs the 200×–500× microbench in #837: the microbench measured an isolated WHERE-filter on 10k literal-string targets where the filter cost was 99% of the query time. S8 measures full Flux community queries dominated by reifier metadata joins and result hydration — the WHERE filter is one operation of many, and its cost amortises into the rest. The microbench number characterises the per-operation speedup honestly; these S8 numbers characterise the user-visible workload impact honestly.

Test plan

cargo check --tests clean
cargo test --lib perspectives::sparql_store — 80 pass (incl. new typed-literal storage + v3/v4 migration assertions)
cargo test --lib perspectives::model_query — 146 pass
cargo test --lib perspectives:: — 323/324 pass; the single failing test (test_perspective_persistence_roundtrip) is a pre-existing AgentService init-order flake on the base branch, not a regression here (reproduces on refactor/literal-channel-v-separation directly)
Full CircleCI suite green
Manual smoke: spin executor against an existing dataset with pre-migration IRI-shaped literal targets, verify both migration logs fire on first boot + post-migration queries return the same wire-format strings

Adds TYPED_LITERAL_MIGRATION.md design notes alongside the storage module and a small forward-compatible read addition to parse_literal_fn: when the argument is already a typed RDF literal (non-empty datatype, not xsd:string, value not starting with literal:), pass it through unchanged. This lets queries see correctly-typed values once writes start producing them — without disturbing today's literal:URI path. Read-only change; existing 78 sparql_store tests pass.

coderabbitai · 2026-06-04T00:26:57Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a1fdbc1-bb46-42c3-89fb-8352c10bc472

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/typed-rdf-literals-and-fn-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Move link-target storage from URI-shaped `<literal:string:X>` NamedNodes to typed RDF literals (`"X"^^xsd:string`, `"42"^^xsd:integer`, etc.) so the POS index can use Oxigraph's native xsd comparison and value lookup plans. Round-trips through the wire-format string the SDK expects. - `target_to_storage_term` / `storage_term_to_target_string` translate between `literal:string|number|boolean|json:…` wire values and typed RDF terms, with JSON payloads carrying the `ad4m://json` datatype. - `make_direct_triple` / `insert_link_triples` / `remove_link` / `for_each_matched_link` / `link_from_solution` all flow targets through the new term type, including in the reifier's quoted triple. - `make_reifier_iri` still hashes the wire-format target so the reifier identity stays stable across the typed-literal migration. - `query()` serialises typed literals bound to `?target` / `?t` back to the wire form so hydration's `parse_literal_value` keeps decoding type info. Other variables emit lexical form so SPARQL `STR()` and `COUNT` consumers continue to see raw values. - Drop the `ad4m://fn/parse_literal` SPARQL custom function (and its Rust implementation): typed literals carry their value in the lexical form and their type in the datatype IRI, so the function is a no-op for new storage and the WHERE / Ops paths now run native xsd comparisons. - The mention-waker subscription query also switches to `STR(?target)` directly — no parse_literal needed. Adds `add_link_with_raw_iri_target` for migration tests that need to seed pre-typed-literal data shapes.

…ive Ops filters For properties tagged `resolveLanguage: literal` the WHERE builder and the projection where-pattern builder now emit `"X"^^xsd:string` / `"42"^^xsd:integer` / `"true"^^xsd:boolean` / VALUES sets of the same, matching the typed-literal storage form so Oxigraph can probe the POS index directly. The `WhereCondition::Ops` branch drops the `fn/parse_literal` / `xsd:double` BIND chain and runs native SPARQL filters against the bound target: - `gt` / `gte` / `lt` / `lte` / `between` → typed numeric comparisons, using the same xsd:integer vs xsd:decimal split as the storage layer. Non-finite filter values short-circuit to `FILTER(false)`. - `contains` → `CONTAINS(LCASE(STR(?val)), …)` (STR handles both typed literals and any residual NamedNode targets). - `not` (scalar / array) → typed-literal `!=` and `NOT IN` lists. The absolute-IRI UNION fallback for String / StringArray on literal properties stays — constructor-seeded raw URIs on a `resolveLanguage='literal'` property are still kept as NamedNodes in storage and need to match. The property-sort sub-query also drops the `SUBSTR(STR(…), 16)` slice that assumed the `literal:string:` prefix; `STR(?val)` returns the lexical form for typed literals directly.

- Drop the `fn/parse_literal` vs indexed-IRI benchmark — the function it compared against is gone and the indexed shape is now the only path the WHERE builder emits. - Legacy envelope migration tests use `add_link_with_raw_iri_target` so the v3 migration sees pre-typed-literal data shapes; the regular `add_link` path now normalises envelopes on the way in. - Tests that used `literal:string:X` URIs as triple subjects switch to real IRIs (`ad4m://…`). Typed-literal storage means the same value can't simultaneously be a subject (IRI) and a target (typed literal) the way it could when both sides were NamedNodes. - Assertions over canonically-encoded targets pick up the `NON_ALPHANUMERIC` percent-encoding that `literal_encode` already used — underscores round-trip as `%5F`. - `literal_percent_encode` is now test-only, marked `#[allow(dead_code)]`. - The signed-envelope migration assertion checks for the canonical underscore-encoded form post-migration.

The plan-as-artifact is gone now that the storage layer, WHERE / Ops emission, and v3+v4 migrations have landed.

…ped-rdf-literals-and-fn-cleanup

The helper exists only to synthesise raw `literal:string:X` IRIs in integration tests that seed pre-migration storage shapes. Production paths construct typed RDF literals directly and never need to round-trip through the URI form. Move the function (and its imports) behind cfg(test) so it drops out of release builds entirely instead of riding along under allow(dead_code).

HexaField added 4 commits June 4, 2026 11:09

docs: remove typed-literal migration plan after implementation

6e0c7bb

The plan-as-artifact is gone now that the storage layer, WHERE / Ops emission, and v3+v4 migrations have landed.

HexaField changed the title ~~refactor: typed RDF literals + remove fn/parse_literal (stacked on #837)~~ refactor: typed RDF literals on the wire + remove fn/parse_literal Jun 4, 2026

HexaField mentioned this pull request Jun 4, 2026

refactor: deterministic literal property targets + indexed WHERE filters #837

Draft

11 tasks

HexaField changed the title ~~refactor: typed RDF literals on the wire + remove fn/parse_literal~~ refactor: store property values as typed RDF literals, drop fn/parse_literal Jun 4, 2026

HexaField mentioned this pull request Jun 4, 2026

fix(executor): cache CUSTOM_DENO_SNAPSHOT.bin by deno_runtime revision coasys/ad4m-wind-tunnel#4

Merged

HexaField added 3 commits June 4, 2026 13:31

Merge branch 'refactor/literal-channel-v-separation' into refactor/ty…

99e90e6

…ped-rdf-literals-and-fn-cleanup

ci: re-trigger after dev merge

d890d9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: store property values as typed RDF literals, drop fn/parse_literal#842

refactor: store property values as typed RDF literals, drop fn/parse_literal#842
HexaField wants to merge 8 commits into
refactor/literal-channel-v-separationfrom
refactor/typed-rdf-literals-and-fn-cleanup

HexaField commented Jun 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HexaField commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Storage layer (sparql_store.rs)

Migrations

WHERE / projection / Ops

Cleanup

Last-write-wins / scalar aggregation pushdown

Wind tunnel — S8 (Subject Class Queries) vs dev

Test plan

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HexaField commented Jun 4, 2026 •

edited

Loading

Storage layer (`sparql_store.rs`)

Wind tunnel — S8 (Subject Class Queries) vs `dev`

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading