refactor: store property values as typed RDF literals, drop fn/parse_literal#842
Draft
HexaField wants to merge 8 commits into
Draft
Conversation
Adds TYPED_LITERAL_MIGRATION.md design notes alongside the storage module and a small forward-compatible read addition to parse_literal_fn: when the argument is already a typed RDF literal (non-empty datatype, not xsd:string, value not starting with literal:), pass it through unchanged. This lets queries see correctly-typed values once writes start producing them — without disturbing today's literal:URI path. Read-only change; existing 78 sparql_store tests pass.
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Move link-target storage from URI-shaped `<literal:string:X>` NamedNodes to typed RDF literals (`"X"^^xsd:string`, `"42"^^xsd:integer`, etc.) so the POS index can use Oxigraph's native xsd comparison and value lookup plans. Round-trips through the wire-format string the SDK expects. - `target_to_storage_term` / `storage_term_to_target_string` translate between `literal:string|number|boolean|json:…` wire values and typed RDF terms, with JSON payloads carrying the `ad4m://json` datatype. - `make_direct_triple` / `insert_link_triples` / `remove_link` / `for_each_matched_link` / `link_from_solution` all flow targets through the new term type, including in the reifier's quoted triple. - `make_reifier_iri` still hashes the wire-format target so the reifier identity stays stable across the typed-literal migration. - `query()` serialises typed literals bound to `?target` / `?t` back to the wire form so hydration's `parse_literal_value` keeps decoding type info. Other variables emit lexical form so SPARQL `STR()` and `COUNT` consumers continue to see raw values. - Drop the `ad4m://fn/parse_literal` SPARQL custom function (and its Rust implementation): typed literals carry their value in the lexical form and their type in the datatype IRI, so the function is a no-op for new storage and the WHERE / Ops paths now run native xsd comparisons. - The mention-waker subscription query also switches to `STR(?target)` directly — no parse_literal needed. Adds `add_link_with_raw_iri_target` for migration tests that need to seed pre-typed-literal data shapes.
…ive Ops filters For properties tagged `resolveLanguage: literal` the WHERE builder and the projection where-pattern builder now emit `"X"^^xsd:string` / `"42"^^xsd:integer` / `"true"^^xsd:boolean` / VALUES sets of the same, matching the typed-literal storage form so Oxigraph can probe the POS index directly. The `WhereCondition::Ops` branch drops the `fn/parse_literal` / `xsd:double` BIND chain and runs native SPARQL filters against the bound target: - `gt` / `gte` / `lt` / `lte` / `between` → typed numeric comparisons, using the same xsd:integer vs xsd:decimal split as the storage layer. Non-finite filter values short-circuit to `FILTER(false)`. - `contains` → `CONTAINS(LCASE(STR(?val)), …)` (STR handles both typed literals and any residual NamedNode targets). - `not` (scalar / array) → typed-literal `!=` and `NOT IN` lists. The absolute-IRI UNION fallback for String / StringArray on literal properties stays — constructor-seeded raw URIs on a `resolveLanguage='literal'` property are still kept as NamedNodes in storage and need to match. The property-sort sub-query also drops the `SUBSTR(STR(…), 16)` slice that assumed the `literal:string:` prefix; `STR(?val)` returns the lexical form for typed literals directly.
- Drop the `fn/parse_literal` vs indexed-IRI benchmark — the function it compared against is gone and the indexed shape is now the only path the WHERE builder emits. - Legacy envelope migration tests use `add_link_with_raw_iri_target` so the v3 migration sees pre-typed-literal data shapes; the regular `add_link` path now normalises envelopes on the way in. - Tests that used `literal:string:X` URIs as triple subjects switch to real IRIs (`ad4m://…`). Typed-literal storage means the same value can't simultaneously be a subject (IRI) and a target (typed literal) the way it could when both sides were NamedNodes. - Assertions over canonically-encoded targets pick up the `NON_ALPHANUMERIC` percent-encoding that `literal_encode` already used — underscores round-trip as `%5F`. - `literal_percent_encode` is now test-only, marked `#[allow(dead_code)]`. - The signed-envelope migration assertion checks for the canonical underscore-encoded form post-migration.
The plan-as-artifact is gone now that the storage layer, WHERE / Ops emission, and v3+v4 migrations have landed.
11 tasks
…ped-rdf-literals-and-fn-cleanup
The helper exists only to synthesise raw `literal:string:X` IRIs in integration tests that seed pre-migration storage shapes. Production paths construct typed RDF literals directly and never need to round-trip through the URI form. Move the function (and its imports) behind cfg(test) so it drops out of release builds entirely instead of riding along under allow(dead_code).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #837. Completes the storage-layer optimisation: link targets for literal-typed property values move from URI-shaped
<literal:string:foo>NamedNodes to typed RDF terms ("foo"^^xsd:string,"42"^^xsd:integer,"3.14"^^xsd:decimal,"true"^^xsd:boolean,"{…}"^^<ad4m://json>). The customfn/parse_literalSPARQL function is gone — WHERE / projection /Opsfilters all compile to native typed comparisons. SDK wire format is unchanged; the storage layer round-trips typed literals back toliteral:string:XURLs atquery_links/query()boundaries so anything that reads link targets keeps seeing the same strings.Why bother, since #837 already gives us indexed POS-probe equality? Two things:
Opscomparisons get the index too.gt/lt/between/contains/notwere the one branch refactor: deterministic literal property targets + indexed WHERE filters #837 left wrapped infn/parse_literal(the BIND + STR wrapper had to materialise every row to a typed value before comparing). With typed literals in storage, those filters compile to native xsd comparisons (?val > "42"^^xsd:integer) and use the same index lookups equality does.fn/parse_literalregistration goes away. The custom function exists today only because the old IRI shape needed it. After this PR there is no callsite — string equality reads typed literals directly; numeric comparison uses xsd ordering; envelope unwrap is handled by the migration on first boot.What changed
Storage layer (
sparql_store.rs)target_to_storage_term(target: &str) -> Termparsesliteral:prefixes and returns the appropriate typed RDF literal (xsd:string/:integer/:decimal/:boolean/ad4m://json). Non-literal targets stay asNamedNodes. Inversestorage_term_to_target_string(&Term) -> Stringrebuilds the URL form for the SDK wire.make_direct_triplenow returns(NamedNode, NamedNode, Term);insert_link_triples/remove_link/for_each_matched_link/link_from_solution/query()updated accordingly.query()serialises typed literals back to wire form only when the bound SPARQL variable is?target/?t, preservingSTR(?x) = "true"filter semantics for non-target variables and keepingCOUNT()results integer-string-shaped.Migrations
migrate_signed_envelopes_to_plain_literals, from refactor: deterministic literal property targets + indexed WHERE filters #837) now lands directly on typed-literal storage instead ofliteral:string:-IRI form.migrate_iri_literals_to_typed_literals, new) walks reifiers, finds any quad whose object is aTerm::NamedNodematchingliteral:(string|number|boolean|json):.*, and rewrites the triple + reifier triple-term to use typed-literal storage. Idempotent.initialize_from_db; per-step migration log messages preserved.WHERE / projection / Ops
is_literal_propequality emits?source <pred> "X"^^xsd:string(or:integer/:decimal/:boolean); arrays useVALUES ?x { …typed-literal terms… }. The constructor-default raw-IRI UNION fallback is preserved for the fewresolveLanguage='literal'properties that hold raw URIs.Opsbranch runs native typed comparisons:gt/gte/lt/lte/betweenuseformat_literal_number+xsd:integerorxsd:decimaltyped values.containsusesSTR(?val)(works on both typed literals and IRIs, defensive for the upgrade window).not/not_arrayuse direct!=/NOT INagainst typed literals.BIND(STR(<ad4m://fn/parse_literal>(?_pw_X)) AS …)line is gone. No per-row function calls anywhere in WHERE.Cleanup
parse_literal_fndefinition + registration removed. The function no longer needs to exist.bench_indexed_iri_vs_fn_parse_literal_filterdeleted — the pre-refactor form it benchmarked against doesn't exist anymore. The original perf delta is documented in #837's description.mcp/tools/subscriptions.rs) now compares againstSTR(?target)directly.SUBSTR(STR(…), 16)literal-IRI slice —STR()returns the lexical form for typed literals directly.Last-write-wins / scalar aggregation pushdown
Explored in #846. The straightforward nested-aggregate plan hits an Oxigraph 0.5.8 planner cliff — confirmed empirically at ~23,000× regression on
test_perf_flux_message_parent_scope_paginated, exactly the failure mode theac57680b9commit message warned about. The most plausible unblock is storage-level partitioning from named graphs (#812) — once each subject instance has its own graph, the innerMAX(?ts) GROUP BY ?source ?predicateaggregate operates on a partitioned working set rather than scanning the whole reifier index. Re-attempt withGRAPH ?gscoping once #812 lands.Correcting earlier framing: SPARQL itself does not introduce window functions in either 1.1 or the W3C 1.2 draft, so the wait is on Oxigraph planner / dataset semantics rather than a SPARQL spec release.
Wind tunnel — S8 (Subject Class Queries) vs
devThree-branch run on Apple Silicon (48 GB / 14 CPU) with a fresh
CUSTOM_DENO_SNAPSHOT.binregenerated against each branch's own Deno deps. Mean of 5 runs per query, lower is faster;<1.0×is an improvement over dev.Headlines:
allItemsat medium tier on refactor: deterministic literal property targets + indexed WHERE filters #837 (1.09×, +2.25 ms on a 24 ms query) which sits within run-to-run noise — the same query on refactor: store property values as typed RDF literals, drop fn/parse_literal #842 lands at 0.99×.pinnedConversations,subgroupTopics,messageHydration,recentConversationsall 22–27% faster). These are exactly the queries where eliminatingfn/parse_literal's per-row BIND+FILTER overhead matters most relative to total query cost.How to read this vs the 200×–500× microbench in #837: the microbench measured an isolated WHERE-filter on 10k literal-string targets where the filter cost was 99% of the query time. S8 measures full Flux community queries dominated by reifier metadata joins and result hydration — the WHERE filter is one operation of many, and its cost amortises into the rest. The microbench number characterises the per-operation speedup honestly; these S8 numbers characterise the user-visible workload impact honestly.
Test plan
cargo check --testscleancargo test --lib perspectives::sparql_store— 80 pass (incl. new typed-literal storage + v3/v4 migration assertions)cargo test --lib perspectives::model_query— 146 passcargo test --lib perspectives::— 323/324 pass; the single failing test (test_perspective_persistence_roundtrip) is a pre-existing AgentService init-order flake on the base branch, not a regression here (reproduces onrefactor/literal-channel-v-separationdirectly)