Skip to content

refactor: store property values as typed RDF literals, drop fn/parse_literal#842

Draft
HexaField wants to merge 8 commits into
refactor/literal-channel-v-separationfrom
refactor/typed-rdf-literals-and-fn-cleanup
Draft

refactor: store property values as typed RDF literals, drop fn/parse_literal#842
HexaField wants to merge 8 commits into
refactor/literal-channel-v-separationfrom
refactor/typed-rdf-literals-and-fn-cleanup

Conversation

@HexaField
Copy link
Copy Markdown
Contributor

@HexaField HexaField commented Jun 4, 2026

Summary

Stacked on #837. Completes the storage-layer optimisation: link targets for literal-typed property values move from URI-shaped <literal:string:foo> NamedNodes to typed RDF terms ("foo"^^xsd:string, "42"^^xsd:integer, "3.14"^^xsd:decimal, "true"^^xsd:boolean, "{…}"^^<ad4m://json>). The custom fn/parse_literal SPARQL function is gone — WHERE / projection / Ops filters all compile to native typed comparisons. SDK wire format is unchanged; the storage layer round-trips typed literals back to literal:string:X URLs at query_links / query() boundaries so anything that reads link targets keeps seeing the same strings.

Why bother, since #837 already gives us indexed POS-probe equality? Two things:

  1. Ops comparisons get the index too. gt / lt / between / contains / not were the one branch refactor: deterministic literal property targets + indexed WHERE filters #837 left wrapped in fn/parse_literal (the BIND + STR wrapper had to materialise every row to a typed value before comparing). With typed literals in storage, those filters compile to native xsd comparisons (?val > "42"^^xsd:integer) and use the same index lookups equality does.
  2. fn/parse_literal registration goes away. The custom function exists today only because the old IRI shape needed it. After this PR there is no callsite — string equality reads typed literals directly; numeric comparison uses xsd ordering; envelope unwrap is handled by the migration on first boot.

What changed

Storage layer (sparql_store.rs)

  • target_to_storage_term(target: &str) -> Term parses literal: prefixes and returns the appropriate typed RDF literal (xsd:string / :integer / :decimal / :boolean / ad4m://json). Non-literal targets stay as NamedNodes. Inverse storage_term_to_target_string(&Term) -> String rebuilds the URL form for the SDK wire.
  • make_direct_triple now returns (NamedNode, NamedNode, Term); insert_link_triples / remove_link / for_each_matched_link / link_from_solution / query() updated accordingly.
  • Reifier IRI hash continues to derive from the wire-format target string, so identities are stable across the storage flip — the migration rewrites in place without orphaning reifiers.
  • query() serialises typed literals back to wire form only when the bound SPARQL variable is ?target / ?t, preserving STR(?x) = "true" filter semantics for non-target variables and keeping COUNT() results integer-string-shaped.

Migrations

  • v3 (migrate_signed_envelopes_to_plain_literals, from refactor: deterministic literal property targets + indexed WHERE filters #837) now lands directly on typed-literal storage instead of literal:string:-IRI form.
  • v4 (migrate_iri_literals_to_typed_literals, new) walks reifiers, finds any quad whose object is a Term::NamedNode matching literal:(string|number|boolean|json):.*, and rewrites the triple + reifier triple-term to use typed-literal storage. Idempotent.
  • Both wired into initialize_from_db; per-step migration log messages preserved.

WHERE / projection / Ops

  • is_literal_prop equality emits ?source <pred> "X"^^xsd:string (or :integer / :decimal / :boolean); arrays use VALUES ?x { …typed-literal terms… }. The constructor-default raw-IRI UNION fallback is preserved for the few resolveLanguage='literal' properties that hold raw URIs.
  • Ops branch runs native typed comparisons:
    • gt / gte / lt / lte / between use format_literal_number + xsd:integer or xsd:decimal typed values.
    • contains uses STR(?val) (works on both typed literals and IRIs, defensive for the upgrade window).
    • not / not_array use direct != / NOT IN against typed literals.
  • The BIND(STR(<ad4m://fn/parse_literal>(?_pw_X)) AS …) line is gone. No per-row function calls anywhere in WHERE.

Cleanup

  • parse_literal_fn definition + registration removed. The function no longer needs to exist.
  • bench_indexed_iri_vs_fn_parse_literal_filter deleted — the pre-refactor form it benchmarked against doesn't exist anymore. The original perf delta is documented in #837's description.
  • MCP mention-waker subscription (mcp/tools/subscriptions.rs) now compares against STR(?target) directly.
  • Property-sort path drops the SUBSTR(STR(…), 16) literal-IRI slice — STR() returns the lexical form for typed literals directly.

Last-write-wins / scalar aggregation pushdown

Explored in #846. The straightforward nested-aggregate plan hits an Oxigraph 0.5.8 planner cliff — confirmed empirically at ~23,000× regression on test_perf_flux_message_parent_scope_paginated, exactly the failure mode the ac57680b9 commit message warned about. The most plausible unblock is storage-level partitioning from named graphs (#812) — once each subject instance has its own graph, the inner MAX(?ts) GROUP BY ?source ?predicate aggregate operates on a partitioned working set rather than scanning the whole reifier index. Re-attempt with GRAPH ?g scoping once #812 lands.

Correcting earlier framing: SPARQL itself does not introduce window functions in either 1.1 or the W3C 1.2 draft, so the wait is on Oxigraph planner / dataset semantics rather than a SPARQL spec release.

Wind tunnel — S8 (Subject Class Queries) vs dev

Three-branch run on Apple Silicon (48 GB / 14 CPU) with a fresh CUSTOM_DENO_SNAPSHOT.bin regenerated against each branch's own Deno deps. Mean of 5 runs per query, lower is faster; <1.0× is an improvement over dev.

========= TIER: small (~1.9k links) =========
query                       dev          #837         #842        837/dev  842/dev
totalItemCount             0.32 ms      0.32 ms      0.28 ms      1.00x    0.88x
allItems                   0.97 ms      0.96 ms      0.94 ms      0.99x    0.97x
unprocessedItems           0.37 ms      0.37 ms      0.35 ms      1.00x    0.95x
recentConversations        0.23 ms      0.22 ms      0.18 ms      0.96x    0.78x
pinnedConversations        0.11 ms      0.11 ms      0.08 ms      1.00x    0.73x
subgroupItemsData          0.21 ms      0.22 ms      0.19 ms      1.05x    0.90x
subgroupTopics             0.11 ms      0.11 ms      0.08 ms      1.00x    0.73x
messageHydration           0.12 ms      0.12 ms      0.09 ms      1.00x    0.75x
paginatedMessages          1.15 ms      1.15 ms      1.14 ms      1.00x    0.99x

========= TIER: medium (~58k links) =========
query                       dev          #837         #842        837/dev  842/dev
totalItemCount             3.09 ms      3.08 ms      3.06 ms      1.00x    0.99x
allItems                  23.94 ms     26.19 ms     23.82 ms      1.09x    0.99x
unprocessedItems           7.16 ms      7.27 ms      7.17 ms      1.02x    1.00x
recentConversations        0.53 ms      0.52 ms      0.52 ms      0.98x    0.98x
pinnedConversations        0.13 ms      0.12 ms      0.13 ms      0.92x    1.00x
subgroupItemsData          0.28 ms      0.26 ms      0.27 ms      0.93x    0.96x
subgroupTopics             0.14 ms      0.12 ms      0.13 ms      0.86x    0.93x
messageHydration           0.14 ms      0.14 ms      0.14 ms      1.00x    1.00x
paginatedMessages         39.39 ms     40.66 ms     39.37 ms      1.03x    1.00x

Headlines:

How to read this vs the 200×–500× microbench in #837: the microbench measured an isolated WHERE-filter on 10k literal-string targets where the filter cost was 99% of the query time. S8 measures full Flux community queries dominated by reifier metadata joins and result hydration — the WHERE filter is one operation of many, and its cost amortises into the rest. The microbench number characterises the per-operation speedup honestly; these S8 numbers characterise the user-visible workload impact honestly.

Test plan

  • cargo check --tests clean
  • cargo test --lib perspectives::sparql_store — 80 pass (incl. new typed-literal storage + v3/v4 migration assertions)
  • cargo test --lib perspectives::model_query — 146 pass
  • cargo test --lib perspectives:: — 323/324 pass; the single failing test (test_perspective_persistence_roundtrip) is a pre-existing AgentService init-order flake on the base branch, not a regression here (reproduces on refactor/literal-channel-v-separation directly)
  • Full CircleCI suite green
  • Manual smoke: spin executor against an existing dataset with pre-migration IRI-shaped literal targets, verify both migration logs fire on first boot + post-migration queries return the same wire-format strings

Adds TYPED_LITERAL_MIGRATION.md design notes alongside the storage
module and a small forward-compatible read addition to parse_literal_fn:
when the argument is already a typed RDF literal (non-empty datatype,
not xsd:string, value not starting with literal:), pass it through
unchanged. This lets queries see correctly-typed values once writes
start producing them — without disturbing today's literal:URI path.

Read-only change; existing 78 sparql_store tests pass.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a1fdbc1-bb46-42c3-89fb-8352c10bc472

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/typed-rdf-literals-and-fn-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

HexaField added 4 commits June 4, 2026 11:09
Move link-target storage from URI-shaped `<literal:string:X>` NamedNodes
to typed RDF literals (`"X"^^xsd:string`, `"42"^^xsd:integer`, etc.) so
the POS index can use Oxigraph's native xsd comparison and value lookup
plans.  Round-trips through the wire-format string the SDK expects.

- `target_to_storage_term` / `storage_term_to_target_string` translate
  between `literal:string|number|boolean|json:…` wire values and typed
  RDF terms, with JSON payloads carrying the `ad4m://json` datatype.
- `make_direct_triple` / `insert_link_triples` / `remove_link` /
  `for_each_matched_link` / `link_from_solution` all flow targets
  through the new term type, including in the reifier's quoted triple.
- `make_reifier_iri` still hashes the wire-format target so the
  reifier identity stays stable across the typed-literal migration.
- `query()` serialises typed literals bound to `?target` / `?t` back to
  the wire form so hydration's `parse_literal_value` keeps decoding type
  info.  Other variables emit lexical form so SPARQL `STR()` and `COUNT`
  consumers continue to see raw values.
- Drop the `ad4m://fn/parse_literal` SPARQL custom function (and its
  Rust implementation): typed literals carry their value in the lexical
  form and their type in the datatype IRI, so the function is a no-op
  for new storage and the WHERE / Ops paths now run native xsd
  comparisons.
- The mention-waker subscription query also switches to `STR(?target)`
  directly — no parse_literal needed.

Adds `add_link_with_raw_iri_target` for migration tests that need to
seed pre-typed-literal data shapes.
…ive Ops filters

For properties tagged `resolveLanguage: literal` the WHERE builder and
the projection where-pattern builder now emit `"X"^^xsd:string` /
`"42"^^xsd:integer` / `"true"^^xsd:boolean` / VALUES sets of the same,
matching the typed-literal storage form so Oxigraph can probe the POS
index directly.

The `WhereCondition::Ops` branch drops the `fn/parse_literal` /
`xsd:double` BIND chain and runs native SPARQL filters against the
bound target:

- `gt` / `gte` / `lt` / `lte` / `between` → typed numeric comparisons,
  using the same xsd:integer vs xsd:decimal split as the storage layer.
  Non-finite filter values short-circuit to `FILTER(false)`.
- `contains` → `CONTAINS(LCASE(STR(?val)), …)` (STR handles both typed
  literals and any residual NamedNode targets).
- `not` (scalar / array) → typed-literal `!=` and `NOT IN` lists.

The absolute-IRI UNION fallback for String / StringArray on literal
properties stays — constructor-seeded raw URIs on a
`resolveLanguage='literal'` property are still kept as NamedNodes in
storage and need to match.

The property-sort sub-query also drops the `SUBSTR(STR(…), 16)` slice
that assumed the `literal:string:` prefix; `STR(?val)` returns the
lexical form for typed literals directly.
- Drop the `fn/parse_literal` vs indexed-IRI benchmark — the function
  it compared against is gone and the indexed shape is now the only
  path the WHERE builder emits.
- Legacy envelope migration tests use `add_link_with_raw_iri_target`
  so the v3 migration sees pre-typed-literal data shapes; the regular
  `add_link` path now normalises envelopes on the way in.
- Tests that used `literal:string:X` URIs as triple subjects switch to
  real IRIs (`ad4m://…`).  Typed-literal storage means the same value
  can't simultaneously be a subject (IRI) and a target (typed literal)
  the way it could when both sides were NamedNodes.
- Assertions over canonically-encoded targets pick up the
  `NON_ALPHANUMERIC` percent-encoding that `literal_encode` already
  used — underscores round-trip as `%5F`.
- `literal_percent_encode` is now test-only, marked `#[allow(dead_code)]`.
- The signed-envelope migration assertion checks for the canonical
  underscore-encoded form post-migration.
The plan-as-artifact is gone now that the storage layer, WHERE / Ops
emission, and v3+v4 migrations have landed.
@HexaField HexaField changed the title refactor: typed RDF literals + remove fn/parse_literal (stacked on #837) refactor: typed RDF literals on the wire + remove fn/parse_literal Jun 4, 2026
@HexaField HexaField changed the title refactor: typed RDF literals on the wire + remove fn/parse_literal refactor: store property values as typed RDF literals, drop fn/parse_literal Jun 4, 2026
HexaField added 3 commits June 4, 2026 13:31
The helper exists only to synthesise raw `literal:string:X` IRIs in
integration tests that seed pre-migration storage shapes. Production
paths construct typed RDF literals directly and never need to round-trip
through the URI form. Move the function (and its imports) behind
cfg(test) so it drops out of release builds entirely instead of riding
along under allow(dead_code).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant