Skip to content

Commit f4e4601

Browse files
authored
Merge pull request #135 from bluedynamics/feat/strip-path-from-idx
Strip path/parent_path/path_depth from idx JSONB
2 parents 11edd2c + b87bf6f commit f4e4601

17 files changed

Lines changed: 2158 additions & 95 deletions

CHANGES.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,32 @@
11
# Changelog
22

3+
## 1.0.0b54
4+
5+
### Changed
6+
7+
- Stop duplicating `path`, `path_parent`, and `path_depth` between the typed
8+
columns on `object_state` and the `idx` JSONB. These three fields now live
9+
exclusively in their typed columns (`path`, `parent_path`, `path_depth`) —
10+
previously identical values were stored in both places, wasting ~10 % of
11+
JSONB storage and (more importantly) blocking the planner from collecting
12+
selectivity statistics on path-subtree filters. Indexes and extended
13+
statistics on these fields have been migrated to reference the typed columns
14+
directly. Custom `PATH`-type indexes (e.g. `tgpath`) are unaffected and
15+
continue to store their data in `idx`.
16+
17+
**Migration:** Schema and writer changes are picked up automatically on
18+
startup (the eight affected indexes and three extended-statistics objects
19+
are reissued with idempotent `DROP … IF EXISTS` / `CREATE … IF NOT EXISTS`
20+
pairs). To strip the obsolete keys from existing JSONB on large catalogs,
21+
run:
22+
23+
```python
24+
from plone.pgcatalog.migrations.strip_path_keys import run
25+
run(conn, batch_size=5000)
26+
```
27+
28+
Safe to run online, idempotent, batched. Issue #132.
29+
330
## 1.0.0b53
431

532
### Fixed

docs/plans/2026-04-15-strip-path-from-idx-jsonb.md

Lines changed: 1094 additions & 0 deletions
Large diffs are not rendered by default.

docs/superpowers/specs/2026-04-15-suggestions-engine-pr-alpha-design.md

Lines changed: 325 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# PR β (Suggestions Engine) — Pre-brainstorm Notes
2+
3+
> **This is NOT a design spec.** It captures decisions and open questions agreed during the PR α brainstorm (2026-04-15) that belong to PR β but were deferred. Use this as the starting material for the PR β brainstorm; don't re-derive these answers.
4+
5+
**Parent spec:** `2026-04-15-suggestions-engine-pr-alpha-design.md`
6+
**Issue:** [bluedynamics/plone-pgcatalog#122](https://github.com/bluedynamics/plone-pgcatalog/issues/122)
7+
8+
## Scope of PR β
9+
10+
Build on PR α's Bundle data model to add:
11+
12+
1. **EXPLAIN-driven grading** — live-DB insight per slow-query row and per bundle.
13+
2. **JSON endpoint** — server renders grades + plans as JSON; UI fetches async.
14+
3. **ZMI UI refactor** — DTML → JSON + vanilla JS for the Slow Queries tab (no heavy framework, per project rule).
15+
4. **Opt-in ANALYZE** — user-triggered deep inspection per row.
16+
17+
## Decisions already locked
18+
19+
### Q5 (EXPLAIN's role) — two-stage: rules propose, EXPLAIN grades + annotates
20+
21+
Bundle candidates come from PR α's rule-based dispatcher. For each slow-query group, PR β runs `EXPLAIN (FORMAT JSON)` on the representative row and:
22+
23+
- Extracts the chosen plan (indexes used, top cost nodes, estimated rows, filter predicates).
24+
- Grades each candidate bundle against the baseline plan.
25+
- Attaches a diagnostic panel: the top filter-cost node (node type, estimated Rows Removed by Filter when inferable from plan shape), so users see **why** the query is slow even if no bundle changes it.
26+
27+
### Q7 (EXPLAIN trigger) — eager plain + opt-in ANALYZE
28+
29+
- **Default on tab load:** JS fetches plain `EXPLAIN (FORMAT JSON)` per visible slow-query row. Cheap (~10 ms per plan). Grades computed on the server, returned with the plan.
30+
- **On click:** per-row "deep inspect" button triggers `EXPLAIN (ANALYZE, TIMING off, BUFFERS off)` with `SET LOCAL statement_timeout = '10s'`. UI shows a spinner. On completion, Rows Removed by Filter and actual row counts replace the estimates. Timeout → user sees "query too slow to analyze; here are the estimates we have".
31+
32+
### Grading semantics
33+
34+
Three states per bundle (plus the row-level row-is-in-slow-queries fact):
35+
36+
| Grade | Meaning | Signal |
37+
|---|---|---|
38+
| `already_fast` | Query wouldn't be in `pgcatalog_slow_queries`. Not shown — the row wouldn't be listed. | N/A |
39+
| `covered_but_slow` | Current plan uses indexes that match the bundle's target predicates, but the query is still observed slow (it's in the slow-queries table definitionally). | Planner's chosen indexes cover the filter set; Rows Removed by Filter low / low-ish; something else bites (stats drift, bloat, correlation, or the GIN is too big → partial GIN bundle would help). |
40+
| `uncovered` | Current plan has a Seq Scan on `object_state`, OR chosen indexes don't cover the bundle's target predicates. Bundle would plausibly help. | Seq Scan node present, OR index-set difference between chosen and target non-empty. |
41+
42+
Grade applies per-bundle, not per-row. A single slow query may yield multiple bundles each with its own grade.
43+
44+
### Async JS fetch model
45+
46+
- Page initial load: fast (bundles only, no per-row plans).
47+
- On DOMContentLoaded: JS iterates visible slow-query rows, issues `fetch()` to the per-row plan endpoint. Fills in grades and diagnostic panels as responses arrive.
48+
- Per-row "deep inspect" button triggers a second fetch with `?analyze=true`. Separate endpoint or query-param on the same endpoint — design open.
49+
50+
### HypoPG as optional enhancement, graceful degradation
51+
52+
- At startup, check `SELECT 1 FROM pg_extension WHERE extname = 'hypopg'`.
53+
- If present, bundle grading gains a "what-if plan" mode: `SELECT hypopg_create_index('<ddl without CONCURRENTLY>')`, run `EXPLAIN` again, compare. Gives confidence that applying the bundle would actually change the plan. Wrap in a read-only transaction that gets rolled back so hypothetical index disappears.
54+
- If absent, grade only from the baseline plan + index-set comparison heuristic.
55+
56+
### UI refactor scope
57+
58+
- Replace `manage_slow_queries.dtml` with a small JSON+JS page (or keep DTML as HTML shell + JS doing the rendering — **decide in PR β brainstorm**).
59+
- No heavy framework. Vanilla JS, optional lightweight helpers (alpine.js is acceptable per user's earlier "leichtes ok"; React/Vue are not).
60+
- Per-row expansion → shows plan tree, bundles with grade badges, "deep inspect" button.
61+
62+
### PR α → PR β contract
63+
64+
PR β's UI consumes what PR α's `manage_get_slow_query_stats` puts under `row["suggestions_bundles"]`:
65+
66+
- `bundles[*].name` — stable identifier for UI keyed updates.
67+
- `bundles[*].rationale` — rendered as bundle description.
68+
- `bundles[*].shape_classification` — shown as a small badge.
69+
- `bundles[*].members[*]` — rendered as DDL code blocks with Apply / Drop buttons.
70+
71+
## Open questions deferred to PR β brainstorm
72+
73+
1. **JSON endpoint URL shape.** Per-row endpoint keyed by `query_keys` hash? Single batch endpoint? Separate endpoints for baseline plan vs. ANALYZE? Affects caching.
74+
2. **Plan caching.** Key by `(query_keys, representative_params_hash)`? TTL? Invalidate on index apply/drop? How do we handle plans that are valid for 5 seconds but not 5 hours (stats drift)?
75+
3. **Deep-inspect authorization + UX.** ANALYZE actually runs the slow query. Should the UI show a confirmation dialog ("this may take up to 10 s — proceed?") or silently trigger with a spinner? Who is allowed to trigger — `Manage portal` only, or any ZMI user?
76+
4. **Handling missing `query_text`.** PR 2's slow-query log stores the SQL. If an old row has null `query_text` (logged under a pre-PR-2 schema?), we can't EXPLAIN it. Skip? Reconstruct from `query_keys + params`? How do we reconstruct safely?
77+
5. **`_derived_attname_for_key` resolution for MCV stats** — carried over from PR α. If PR α ships COUNT-only, PR β may want to improve it once we know which pg_stats path is reachable for expression indexes.
78+
6. **Bundle-level grade aggregation.** A bundle has N members, each with their own `status` (new / already_covered). Bundle grade aggregates: what's the rule when some members are new and some already covered? Needed for the UI badge.
79+
7. **Rate limiting.** 50 slow queries × 1 EXPLAIN each on every tab load. At 10 ms each that's 500 ms total. Acceptable. If EXPLAIN is ever slow (stats issues), do we need a global timeout / circuit breaker?
80+
8. **Drop-bundle flow.** If a bundle's members were applied together, should there be a "drop bundle" action (drops all members in sequence)? Or only per-member?
81+
9. **HypoPG production readiness.** Is `hypopg` available in typical Plone PG deployments? Is asking customers to install an extension acceptable? (Extended statistics from PR 1 didn't need an extension; this would be new.)
82+
10. **Auto-apply opt-in.** Far future — a "trust the engine" mode where highly-confident uncovered-grade bundles get applied automatically during low-traffic windows? Definitely not PR β, but worth capturing the ergonomic target.
83+
84+
## Non-goals for PR β
85+
86+
- New templates beyond PR α's T1/T3/T4/T5/T6.
87+
- Changing the shape classifier or the partial-scoping threshold.
88+
- Cross-pod coordination for EXPLAIN caching (each pod EXPLAINs independently; that's fine).
89+
- Writing to `pgcatalog_slow_queries` from PR β (read-only).
90+
91+
## Recommended next actions when PR β becomes the active work
92+
93+
1. Re-invoke `superpowers:brainstorming` with this notes file as context.
94+
2. Answer the open questions above in the same Q-by-Q flow as PR α.
95+
3. Produce a proper design spec (`YYYY-MM-DD-suggestions-engine-pr-beta-design.md`).
96+
4. Feed spec into `superpowers:writing-plans`.
97+
5. Execute via `superpowers:subagent-driven-development`.
98+
99+
All PR α infrastructure (Bundle dataclass, `suggestions_bundles` catalog.py output key, test helpers) is ready for PR β to consume — no re-plumbing needed.

src/plone/pgcatalog/catalog.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@
2828
from plone.pgcatalog.backends import BM25Backend
2929
from plone.pgcatalog.backends import get_backend
3030
from plone.pgcatalog.brain import CatalogSearchResults
31-
from plone.pgcatalog.columns import compute_path_info
3231
from plone.pgcatalog.columns import get_registry
3332
from plone.pgcatalog.extraction import extract_from_translators
3433
from plone.pgcatalog.extraction import extract_idx
@@ -496,12 +495,11 @@ def _set_pg_annotation(self, obj, uid=None):
496495
wrapper = self._wrap_object(obj)
497496
idx = self._extract_idx(wrapper)
498497
searchable_text = self._extract_searchable_text(wrapper)
499-
parent_path, path_depth = compute_path_info(uid)
500498

501-
# Store built-in path data in idx JSONB for unified path queries
502-
idx["path"] = uid
503-
idx["path_parent"] = parent_path
504-
idx["path_depth"] = path_depth
499+
# Path data lives in typed columns only. The typed `path` column
500+
# is set below via pending_data["path"]; parent_path and path_depth
501+
# are derived from it by CatalogStateProcessor.
502+
# See: docs/plans/2026-04-15-strip-path-from-idx-jsonb.md (#132)
505503

506504
pending_data = {
507505
"path": uid,

src/plone/pgcatalog/indexing.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,8 @@ def catalog_object(conn, zoid, path, idx, searchable_text=None, language="simple
4040
"""
4141
parent_path, path_depth = compute_path_info(path)
4242

43-
# Store path data in idx JSONB for unified path queries
44-
idx.setdefault("path", path)
45-
idx.setdefault("path_parent", parent_path)
46-
idx.setdefault("path_depth", path_depth)
43+
# Path data lives in typed columns only (path, parent_path, path_depth).
44+
# See: docs/plans/2026-04-15-strip-path-from-idx-jsonb.md (#132)
4745

4846
# Extract registered extra idx columns (pops from idx → dedicated columns)
4947
extra = extract_extra_idx_columns(idx)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Migration helpers for evolving plone-pgcatalog schema/data in place."""
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
"""One-shot migration: remove path/path_parent/path_depth from idx JSONB.
2+
3+
After issue #132, these three keys live exclusively in typed columns
4+
(path, parent_path, path_depth). This script removes them from existing
5+
rows that were written before the cleanup.
6+
7+
Idempotent and batched. Safe to re-run; safe to interrupt.
8+
9+
Usage (from a shell with a configured psycopg connection):
10+
11+
from plone.pgcatalog.migrations.strip_path_keys import run
12+
result = run(conn, batch_size=5000)
13+
print(result) # {"batches": N, "rows_updated": M}
14+
"""
15+
16+
import logging
17+
18+
19+
log = logging.getLogger(__name__)
20+
21+
# Rows that still have any of the three keys present in idx
22+
_DIRTY_PREDICATE = "idx ?| ARRAY['path', 'path_parent', 'path_depth']"
23+
24+
_BATCH_SQL = f"""
25+
WITH batch AS (
26+
SELECT zoid
27+
FROM object_state
28+
WHERE zoid > %(after_zoid)s
29+
AND idx IS NOT NULL
30+
AND {_DIRTY_PREDICATE}
31+
ORDER BY zoid
32+
LIMIT %(batch_size)s
33+
)
34+
UPDATE object_state os
35+
SET idx = idx - 'path' - 'path_parent' - 'path_depth'
36+
FROM batch
37+
WHERE os.zoid = batch.zoid
38+
RETURNING os.zoid
39+
"""
40+
41+
42+
def run(conn, batch_size: int = 5000) -> dict:
43+
"""Strip path keys from idx in batches.
44+
45+
The connection is switched to autocommit for the duration of the
46+
migration so each batch commits independently; the prior autocommit
47+
state is restored on return (even if an exception escapes). If the
48+
caller had an open transaction, ``conn.commit()`` is called first to
49+
flush pending work before flipping autocommit on.
50+
51+
Args:
52+
conn: psycopg connection. Each batch commits independently --
53+
the caller's transaction state (if any) is committed first
54+
before switching to autocommit, and the original autocommit
55+
flag is restored on exit.
56+
batch_size: rows per batch. Default 5000 keeps each UPDATE under
57+
~10 MB WAL and ~1 s on a typical pod.
58+
59+
Returns: {"batches": int, "rows_updated": int}
60+
"""
61+
original_autocommit = conn.autocommit
62+
if not original_autocommit:
63+
conn.commit() # flush any pending work; migration needs per-batch commits
64+
conn.autocommit = True
65+
66+
after_zoid = -1
67+
batches = 0
68+
total = 0
69+
70+
try:
71+
while True:
72+
with conn.cursor() as cur:
73+
cur.execute(
74+
_BATCH_SQL,
75+
{
76+
"after_zoid": after_zoid,
77+
"batch_size": batch_size,
78+
},
79+
)
80+
zoids = [
81+
row[0] if isinstance(row, tuple) else row["zoid"]
82+
for row in cur.fetchall()
83+
]
84+
85+
if not zoids:
86+
break
87+
88+
batches += 1
89+
total += len(zoids)
90+
after_zoid = max(zoids)
91+
log.info(
92+
"strip_path_keys: batch %d, %d rows, last zoid=%d, total=%d",
93+
batches,
94+
len(zoids),
95+
after_zoid,
96+
total,
97+
)
98+
99+
log.info("strip_path_keys: done. %d batches, %d rows updated.", batches, total)
100+
return {"batches": batches, "rows_updated": total}
101+
finally:
102+
conn.autocommit = original_autocommit

src/plone/pgcatalog/processor.py

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
"""
77

88
from plone.pgcatalog.backends import get_backend
9+
from plone.pgcatalog.columns import compute_path_info
910
from plone.pgcatalog.columns import extract_extra_idx_columns
1011
from plone.pgcatalog.columns import get_extra_idx_columns
1112
from plone.pgcatalog.pending import _MISSING
@@ -228,10 +229,19 @@ def process(self, zoid, class_mod, class_name, state):
228229
idx = pending.get("idx")
229230
extra_values = extract_extra_idx_columns(idx)
230231

232+
# Path data lives in typed columns only. parent_path and path_depth
233+
# are derived from the canonical `path` field — not from idx.
234+
# See: docs/plans/2026-04-15-strip-path-from-idx-jsonb.md (#132)
235+
path = pending.get("path")
236+
if path:
237+
parent_path, path_depth = compute_path_info(path)
238+
else:
239+
parent_path, path_depth = None, None
240+
231241
result = {
232-
"path": pending.get("path"),
233-
"parent_path": idx.get("path_parent") if idx else None,
234-
"path_depth": idx.get("path_depth") if idx else None,
242+
"path": path,
243+
"parent_path": parent_path,
244+
"path_depth": path_depth,
235245
"idx": Json(idx) if idx else None,
236246
"searchable_text": pending.get("searchable_text"),
237247
**extra_values,
@@ -275,25 +285,18 @@ def finalize(self, cursor):
275285
{"zoid": zoid, "patch": Json(idx_updates), **extra_params},
276286
)
277287

278-
# Execute bulk path moves (one SQL per moved subtree)
288+
# Execute bulk path moves (one SQL per moved subtree).
289+
# Touches typed columns only — idx no longer carries path keys.
290+
# See: docs/plans/2026-04-15-strip-path-from-idx-jsonb.md (#132)
279291
moves = pop_all_pending_moves()
280292
for old_prefix, new_prefix, depth_delta in moves:
281293
cursor.execute(
282294
"""
283295
UPDATE object_state SET
284296
path = %(new)s || substring(path FROM length(%(old)s) + 1),
285297
parent_path = %(new)s || substring(parent_path FROM length(%(old)s) + 1),
286-
path_depth = path_depth + %(dd)s,
287-
idx = idx || jsonb_build_object(
288-
'path',
289-
%(new)s || substring(idx->>'path' FROM length(%(old)s) + 1),
290-
'path_parent',
291-
%(new)s || substring(idx->>'path_parent' FROM length(%(old)s) + 1),
292-
'path_depth',
293-
(idx->>'path_depth')::int + %(dd)s
294-
)
298+
path_depth = path_depth + %(dd)s
295299
WHERE path LIKE %(like)s
296-
AND idx IS NOT NULL
297300
""",
298301
{
299302
"old": old_prefix,

src/plone/pgcatalog/query.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -514,12 +514,19 @@ def _handle_path(self, name, idx_key, spec):
514514

515515
paths = [_validate_path(p) for p in paths] # validates AND normalizes
516516

517-
# All path indexes (built-in "path" and additional like "tgpath")
518-
# store their data in idx JSONB and query via expression indexes.
519-
key = name if idx_key is None else idx_key
520-
expr_path = f"idx->>'{key}'"
521-
expr_parent = f"idx->>'{key}_parent'"
522-
expr_depth = f"(idx->>'{key}_depth')::integer"
517+
# Dispatch: the built-in "path" index lives in typed columns
518+
# (path, parent_path, path_depth). Custom path indexes
519+
# (e.g. "tgpath") still store their data in idx JSONB.
520+
# See: docs/plans/2026-04-15-strip-path-from-idx-jsonb.md (#132)
521+
if idx_key is None and name == "path":
522+
expr_path = "path"
523+
expr_parent = "parent_path"
524+
expr_depth = "path_depth"
525+
else:
526+
key = name if idx_key is None else idx_key
527+
expr_path = f"idx->>'{key}'"
528+
expr_parent = f"idx->>'{key}_parent'"
529+
expr_depth = f"(idx->>'{key}_depth')::integer"
523530

524531
if navtree:
525532
self._path_navtree(expr_path, expr_parent, paths[0], depth, navtree_start)
@@ -652,6 +659,11 @@ def _process_sort(self, sort_on_list, sort_order_list):
652659
continue
653660

654661
idx_type, idx_key, _source_attrs = entry
662+
# Built-in "path" sort lives in the typed `path` column (#132).
663+
# Custom PATH indexes (e.g. "tgpath") still store data in idx JSONB.
664+
if idx_type == IndexType.PATH and idx_key is None and sort_on == "path":
665+
parts.append(f"path {direction}")
666+
continue
655667
if idx_key is None:
656668
if idx_type == IndexType.PATH:
657669
idx_key = sort_on

0 commit comments

Comments
 (0)