Skip to content

Commit 09fc4ee

Browse files
jensensclaude
andauthored
fix: PG-driven clearFindAndRebuild — flat memory on large sites (#39) (#40)
* fix: replace ZopeFindAndApply with PG-driven iteration in clearFindAndRebuild (#39) Query object_state directly, filtering out known non-content classes (BTrees, Blob, field values, etc. — ~96% of rows). Load remaining objects via jar.get(), check for reindexObject, index if yes. No acquisition parent chains on the call stack means cacheMinimize() can ghost all objects — flat memory on large sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use astral-sh/setup-uv@v7 (v8 major tag doesn't exist yet) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use pool connection for rebuild cursor, not MVCC read connection The MVCC read connection has a REPEATABLE READ snapshot that may predate clear_catalog_data. Use a fresh pool connection so the server-side cursor sees current data. Also fixes try/finally for pool connection cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use client-side cursor for rebuild query Server-side cursors (name=...) require a transaction block, but pool connections are in autocommit mode. Use a plain cursor with fetchall() instead — the query only returns integer zoids with ~96% filtered out, so the result set is small. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use dict key access for pool cursor rows Pool connections use dict_row factory, not tuple rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: design spec for skipping portal_transforms when Tika is active Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use row[0] for pool cursor (storage pool uses tuple rows, not dict) The zodb-pgjsonb storage pool does not use dict_row factory. Use index-based access which works for both tuple and dict rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: implementation plan for skip-transforms-tika (#41) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: trigger rebuild after row[0] fix * fix: use row['zoid'] for pool cursor (both pools use dict_row) Both zodb-pgjsonb storage pool and pgcatalog fallback pool use psycopg dict_row factory. Integer index access gives KeyError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * debug: add warning logging for rebuild failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: snapshot paths before clearing, use unrestrictedTraverse jar.get() returns unwrapped objects without acquisition context — plone.indexer adapters and getPhysicalPath() need acquisition. Now snapshots paths from object_state BEFORE clearing, then uses unrestrictedTraverse to load acquisition-wrapped objects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 62a1722 commit 09fc4ee

8 files changed

Lines changed: 580 additions & 157 deletions

File tree

.github/workflows/docs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ jobs:
2828
with:
2929
python-version: "3.13"
3030

31-
- uses: astral-sh/setup-uv@v8
31+
- uses: astral-sh/setup-uv@v7
3232

3333
- name: Install docs dependencies
3434
working-directory: docs

.github/workflows/qa.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ jobs:
1414
steps:
1515
- uses: actions/checkout@v5
1616

17-
- uses: astral-sh/setup-uv@v8
17+
- uses: astral-sh/setup-uv@v7
1818

1919
- name: Run ruff check
2020
run: uvx ruff check .

.github/workflows/tests.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ jobs:
5454
with:
5555
fetch-depth: 0
5656

57-
- uses: astral-sh/setup-uv@v8
57+
- uses: astral-sh/setup-uv@v7
5858
with:
5959
enable-cache: true
6060
cache-dependency-glob: "pyproject.toml"
@@ -93,7 +93,7 @@ jobs:
9393
steps:
9494
- uses: actions/checkout@v5
9595

96-
- uses: astral-sh/setup-uv@v8
96+
- uses: astral-sh/setup-uv@v7
9797
with:
9898
enable-cache: true
9999
cache-dependency-glob: "pyproject.toml"

CHANGES.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
11
# Changelog
22

3+
## 1.0.0b24
4+
5+
### Changed
6+
7+
- `clearFindAndRebuild` now uses PG-driven iteration instead of
8+
`ZopeFindAndApply`. Queries `object_state` directly, filtering out
9+
known non-content classes (~96% of rows). No acquisition parent
10+
chains on the call stack means `cacheMinimize()` can ghost all
11+
objects — flat memory on large sites. Fixes #39.
12+
313
## 1.0.0b23
414

515
### Fixed
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Skip portal_transforms for IFile when Tika is active
2+
3+
**Date:** 2026-04-01
4+
**Status:** Approved
5+
**Issue:** N/A (performance improvement for Tika-enabled sites)
6+
7+
## Problem
8+
9+
When Plone indexes a `File` object, the `SearchableText_file` indexer
10+
(from `plone.app.contenttypes`) calls `portal_transforms` to extract
11+
text from the blob's binary data (PDF, DOCX, etc.). This is:
12+
13+
1. **Expensive:** spawns external processes (pdftotext, wv, etc.)
14+
synchronously during the request.
15+
2. **Redundant when Tika is configured:** the async Tika worker
16+
already extracts text from blobs and merges it into
17+
`searchable_text` via `pgcatalog_merge_extracted_text`.
18+
3. **Wasteful even when transforms are missing:** `_findPath()` does a
19+
full BFS graph traversal of the transform registry before
20+
concluding no path exists — not a cheap dict lookup.
21+
22+
## Scope
23+
24+
Only `SearchableText_file` (registered for `IFile`) calls
25+
`portal_transforms`. All other Plone SearchableText indexers
26+
(IDocument, INewsItem, ICollection, IFolder, ILink) only concatenate
27+
text fields — no transforms involved.
28+
29+
`IImage` does NOT extend `IFile` and has no transform-based indexer.
30+
31+
## Design
32+
33+
### New file: `src/plone/pgcatalog/indexers.py`
34+
35+
A `SearchableText` indexer adapter registered for `IFile`:
36+
37+
- **When `PGCATALOG_TIKA_URL` is set:** return `SearchableText(obj)`
38+
(Title + Description only). No `_findPath`, no blob I/O, no
39+
transform call. The Tika worker fills in the blob text
40+
asynchronously as weight 'C' in the tsvector.
41+
- **When `PGCATALOG_TIKA_URL` is NOT set:** delegate to the original
42+
`plone.app.contenttypes.indexers.SearchableText_file` so the full
43+
transform pipeline runs as before.
44+
45+
### ZCML registration
46+
47+
Register in `overrides.zcml` to override the `plone.app.contenttypes`
48+
registration for `IFile`.
49+
50+
### What doesn't change
51+
52+
- `portal_transforms` is untouched — no unregister/re-register.
53+
- The Tika enqueue pipeline in `processor.py` — already works.
54+
- Custom SearchableText indexers for other interfaces — unaffected
55+
(adapter specificity ensures more specific registrations win).
56+
- Tsvector weighting: Title 'A', Description 'B', body 'D',
57+
Tika-extracted text 'C'.
58+
59+
### Fallback behavior
60+
61+
When `PGCATALOG_TIKA_URL` is NOT set, the override delegates to the
62+
original indexer. Zero impact for sites not using Tika.
63+
64+
## Custom types with blob fields
65+
66+
The override only covers `IFile`. If a custom content type has blob
67+
fields and uses its own `SearchableText` indexer that calls
68+
`portal_transforms`, it will NOT be automatically short-circuited.
69+
70+
Developers with such custom types should either:
71+
72+
1. Make their type provide `IFile` (then the override applies), or
73+
2. Register a similar conditional indexer for their custom interface
74+
that checks `PGCATALOG_TIKA_URL` and skips transforms when set.
75+
76+
This should be documented in the package's how-to section.
77+
78+
## Implementation
79+
80+
1. Create `src/plone/pgcatalog/indexers.py` with the conditional
81+
indexer function.
82+
2. Add the adapter registration to `overrides.zcml`.
83+
3. Add tests: with Tika URL set (returns Title+Description only),
84+
without Tika URL (delegates to original).
85+
4. Add documentation section about custom blob types.

0 commit comments

Comments
 (0)