|
| 1 | +# Skip portal_transforms for IFile when Tika is active |
| 2 | + |
| 3 | +**Date:** 2026-04-01 |
| 4 | +**Status:** Approved |
| 5 | +**Issue:** N/A (performance improvement for Tika-enabled sites) |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +When Plone indexes a `File` object, the `SearchableText_file` indexer |
| 10 | +(from `plone.app.contenttypes`) calls `portal_transforms` to extract |
| 11 | +text from the blob's binary data (PDF, DOCX, etc.). This is: |
| 12 | + |
| 13 | +1. **Expensive:** spawns external processes (pdftotext, wv, etc.) |
| 14 | + synchronously during the request. |
| 15 | +2. **Redundant when Tika is configured:** the async Tika worker |
| 16 | + already extracts text from blobs and merges it into |
| 17 | + `searchable_text` via `pgcatalog_merge_extracted_text`. |
| 18 | +3. **Wasteful even when transforms are missing:** `_findPath()` does a |
| 19 | + full BFS graph traversal of the transform registry before |
| 20 | + concluding no path exists — not a cheap dict lookup. |
| 21 | + |
| 22 | +## Scope |
| 23 | + |
| 24 | +Only `SearchableText_file` (registered for `IFile`) calls |
| 25 | +`portal_transforms`. All other Plone SearchableText indexers |
| 26 | +(IDocument, INewsItem, ICollection, IFolder, ILink) only concatenate |
| 27 | +text fields — no transforms involved. |
| 28 | + |
| 29 | +`IImage` does NOT extend `IFile` and has no transform-based indexer. |
| 30 | + |
| 31 | +## Design |
| 32 | + |
| 33 | +### New file: `src/plone/pgcatalog/indexers.py` |
| 34 | + |
| 35 | +A `SearchableText` indexer adapter registered for `IFile`: |
| 36 | + |
| 37 | +- **When `PGCATALOG_TIKA_URL` is set:** return `SearchableText(obj)` |
| 38 | + (Title + Description only). No `_findPath`, no blob I/O, no |
| 39 | + transform call. The Tika worker fills in the blob text |
| 40 | + asynchronously as weight 'C' in the tsvector. |
| 41 | +- **When `PGCATALOG_TIKA_URL` is NOT set:** delegate to the original |
| 42 | + `plone.app.contenttypes.indexers.SearchableText_file` so the full |
| 43 | + transform pipeline runs as before. |
| 44 | + |
| 45 | +### ZCML registration |
| 46 | + |
| 47 | +Register in `overrides.zcml` to override the `plone.app.contenttypes` |
| 48 | +registration for `IFile`. |
| 49 | + |
| 50 | +### What doesn't change |
| 51 | + |
| 52 | +- `portal_transforms` is untouched — no unregister/re-register. |
| 53 | +- The Tika enqueue pipeline in `processor.py` — already works. |
| 54 | +- Custom SearchableText indexers for other interfaces — unaffected |
| 55 | + (adapter specificity ensures more specific registrations win). |
| 56 | +- Tsvector weighting: Title 'A', Description 'B', body 'D', |
| 57 | + Tika-extracted text 'C'. |
| 58 | + |
| 59 | +### Fallback behavior |
| 60 | + |
| 61 | +When `PGCATALOG_TIKA_URL` is NOT set, the override delegates to the |
| 62 | +original indexer. Zero impact for sites not using Tika. |
| 63 | + |
| 64 | +## Custom types with blob fields |
| 65 | + |
| 66 | +The override only covers `IFile`. If a custom content type has blob |
| 67 | +fields and uses its own `SearchableText` indexer that calls |
| 68 | +`portal_transforms`, it will NOT be automatically short-circuited. |
| 69 | + |
| 70 | +Developers with such custom types should either: |
| 71 | + |
| 72 | +1. Make their type provide `IFile` (then the override applies), or |
| 73 | +2. Register a similar conditional indexer for their custom interface |
| 74 | + that checks `PGCATALOG_TIKA_URL` and skips transforms when set. |
| 75 | + |
| 76 | +This should be documented in the package's how-to section. |
| 77 | + |
| 78 | +## Implementation |
| 79 | + |
| 80 | +1. Create `src/plone/pgcatalog/indexers.py` with the conditional |
| 81 | + indexer function. |
| 82 | +2. Add the adapter registration to `overrides.zcml`. |
| 83 | +3. Add tests: with Tika URL set (returns Title+Description only), |
| 84 | + without Tika URL (delegates to original). |
| 85 | +4. Add documentation section about custom blob types. |
0 commit comments