feat: cache reader content images for offline access#2595
Draft
anpryl wants to merge 21 commits into
Draft
Conversation
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
When CRAWLER_STORE_CONTENT_IMAGES is enabled, images found in extracted reader HTML are downloaded, stored as local assets, and the HTML is rewritten to reference local asset URLs. This allows reader content to render fully offline. Includes per-bookmark image count/size limits, storage quota enforcement, and bounded download concurrency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unnecessary ZAssetType import and hiddenAssetType intermediate variable — TypeScript already validates string literals through the asset type's own type definition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hitecture docs The file picker accept attributes had ".jgp" instead of ".jpg", preventing users from uploading lowercase .jpg files for banner images and replacements. Also added the content image caching worker to the architecture documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The doc said "three job types" but listed four after adding content image caching. Changed to "some of the job types include" since the doc only lists a subset of the ~13 actual worker types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use deterministic asset IDs (SHA-256 of bookmarkId:sourceUrl) to enable skip-if-exists on retries and idempotent re-crawls via upsert. Download images sequentially with per-image exponential backoff (10 retries, 30s cap) to handle 429 rate limiting. Add browser-like headers (Chrome 146 UA, Accept image/*, Referer) to reduce rate limiting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add queue stats and "Cache Content Images for All Links" button to the Background Jobs admin page, allowing admins to trigger bulk content image caching for all existing bookmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only show the Content Image Jobs card in the admin panel when CRAWLER_STORE_CONTENT_IMAGES is enabled, to avoid confusing admins who haven't opted in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Accept header already requested these types from servers but the content-type allowlist rejected them. Add a worker-scoped CONTENT_IMAGE_ASSET_TYPES set so these formats are cached without changing the global IMAGE_ASSET_TYPES used for user uploads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add contentImageStatus column to bookmarkLinks so the admin panel can show pending/failed counters for content image jobs, matching the pattern already used by crawlStatus for the crawler. - Schema: nullable contentImageStatus (pending | failure | success) - Worker: set success in onComplete, failure in onError (final retry) - Enqueue: set pending when queueing from crawler and admin bulk action - Admin API: query pending/failed counts, expose in stats response - Frontend: pass full stats to Content Image Jobs card Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Some servers respond with non-image Content-Type headers (text/html, application/octet-stream) for actual images. Fall back to checking file signatures (JPEG, PNG, GIF, WebP, SVG, AVIF) before rejecting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a "Recache Failed Links Only" button to the Content Image Jobs card, matching the pattern from the Crawler Jobs card. The existing bulk action now requires an explicit contentImageStatus filter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After successfully downloading new images and rewriting HTML, delete content image assets that are no longer referenced in the current HTML. This prevents orphaned files from accumulating when page content changes between crawls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd SVG Align contentImageWorker's LAZY_SRC_ATTRS with the preprocessor's list to handle plugin paths (e.g. Reddit) that bypass normalizeLazyLoadImages. Add srcset/data-srcset/data-lazy-srcset parsing (picks largest candidate), high-res attrs (data-hi-res-src, data-highres, data-full-src), and SVG <image href/xlink:href> extraction and rewriting. Refactor downloadImage to accept an options object with configurable maxRetries, fixing two tests that timed out with the hardcoded MAX_RETRIES=10. Add integration tests with a real HTTP server serving minimal valid images for all supported patterns (23 URLs across 7 formats, 12 lazy-load attrs, 3 srcset variants, and SVG image elements). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ee71d3b to
0fc3876
Compare
The saveAsset function validates content types against a hardcoded allowlist that excludes SVG, AVIF, and APNG. Content images with these types would be downloaded but fail to persist silently. Add SUPPORTED_CONTENT_IMAGE_TYPES set extending the base allowlist, and an optional supportedTypes parameter to saveAsset so callers can widen the allowlist without changing defaults for uploads/bookmarks. SVGs are safe here because serveAsset applies sandbox CSP headers that block script execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The bookmark debugger only had bulk recache actions. This adds a per-bookmark "Re-cache images" button so admins can trigger content image caching for a single bookmark without recaching everything. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Surfaces the per-bookmark contentImageStatus field in the Status section so admins can see whether image caching succeeded, failed, or is pending. Only shown when the field is non-null (i.e. image crawling has been triggered for that bookmark). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mobile app and other API-key clients can't resolve relative /api/assets/ URLs or attach auth headers to <img> requests in rendered HTML. When the request comes through Bearer token auth and includeContent is true, replace asset URLs with base64 data URIs by reading assets from storage at serve time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
contentImageWorkerthat downloads images referenced in extracted reader content HTML, saves them asCONTENT_IMAGEassets, and rewrites<img src>URLs to point to/api/assets/{assetId}— so reader view images survive even when the original source goes offline or blocks hotlinking.Closes #1563
Closes #2363
Key design decisions
CRAWLER_STORE_CONTENT_IMAGES=falseenv var so existing users aren't surprised by extra storage/bandwidthSHA-256(bookmarkId:sourceUrl).slice(0,32)enables skip-if-exists on retries and idempotent re-crawlsAccept: image/*,Refererfrom source pageContent-TypeCONTENT_IMAGE_ASSET_TYPES, does not alter globalIMAGE_ASSET_TYPES)AttachmentBox,isAllowedToAttach/Detachboth returnfalseChanges
New files
apps/workers/workers/contentImageWorker.ts— the worker (extraction, download, rewrite, stale cleanup)apps/workers/workers/contentImageWorker.test.ts— 91 unit testsapps/workers/workers/contentImageWorker.integration.test.ts— 5 integration tests with real HTTP serverpackages/db/drizzle/0081_add_content_image_status.sql— adds nullablecontentImageStatuscolumn tobookmarkLinksModified files
packages/db/schema.ts—contentImageStatuscolumn (pending/failure/success),CONTENT_IMAGEasset typepackages/shared/config.ts— 5 new env vars for feature configurationpackages/shared/assetdb.ts—SUPPORTED_CONTENT_IMAGE_TYPESset, optionalsupportedTypesparam onsaveAsset(both store implementations)packages/shared-server/src/queues.ts—ContentImageQueuedefinitionpackages/trpc/routers/admin.ts— content image stats, bulkrecacheContentImagesmutation, per-bookmarkadminRecacheContentImagesBookmarkmutation,contentImageStatusin debug info outputpackages/trpc/models/bookmarks.ts—contentImageStatusinbuildDebugInforesponsepackages/trpc/stats.ts— content image queue statspackages/trpc/index.ts— export content image queuepackages/trpc/lib/attachments.ts—CONTENT_IMAGEnot user-attachableapps/workers/workers/crawlerWorker.ts— enqueue content image job after crawl completesapps/workers/index.ts— register content image workerapps/web/components/admin/BackgroundJobs.tsx— Content Image Jobs card (hidden when feature disabled) with pending/failed counters, bulk recache actionsapps/web/components/admin/BookmarkDebugger.tsx— per-bookmark "Re-cache images" action button, Image Crawl Status badge in status sectionapps/web/components/dashboard/preview/AttachmentBox.tsx— filterCONTENT_IMAGEassets from user-facing attachment listapps/web/lib/i18n/locales/en/translation.json— i18n stringsapps/web/lib/i18n/locales/en_US/translation.json— i18n stringsdocs/docs/03-configuration/01-environment-variables.md— new env var docsdocs/docs/08-development/04-architecture.md— updated worker countImage extraction capabilities
data-src,data-actualsrc,data-srv,data-original,data-lazy,data-lazy-src,data-lazyload,data-img-src,data-url,data-hi-res-src,data-highres,data-full-srcsrcset,data-srcset,data-lazy-srcset— picks largest candidate by width/density<image href>and<image xlink:href>extraction and rewriting<source>inside<picture>, cleans lazy-load attrs after rewritingConfiguration
CRAWLER_STORE_CONTENT_IMAGESfalseCRAWLER_CONTENT_IMAGE_MAX_COUNT50CRAWLER_CONTENT_IMAGE_MAX_SIZE_MB5CONTENT_IMAGE_NUM_WORKERS1CONTENT_IMAGE_JOB_TIMEOUT_SEC120Test plan
Known limitations
/api/assets/URLs, not external URLs). Use Re-crawl first to fetch fresh HTML.maxCounttruncation may cause images beyond the limit to never be cachedtidyAssetsmaintenance)background-imagenot handled (intentionally scoped out)Migration
DB migration
0081_add_content_image_statusadds a nullablecontentImageStatuscolumn tobookmarkLinks. No data migration needed — existing rows default toNULL(not processed).🤖 Generated with Claude Code