Skip to content

feat: cache reader content images for offline access#2595

Draft
anpryl wants to merge 21 commits into
karakeep-app:mainfrom
anpryl:cache-reader-content-images
Draft

feat: cache reader content images for offline access#2595
anpryl wants to merge 21 commits into
karakeep-app:mainfrom
anpryl:cache-reader-content-images

Conversation

@anpryl
Copy link
Copy Markdown

@anpryl anpryl commented Mar 18, 2026

Summary

Adds a new contentImageWorker that downloads images referenced in extracted reader content HTML, saves them as CONTENT_IMAGE assets, and rewrites <img src> URLs to point to /api/assets/{assetId} — so reader view images survive even when the original source goes offline or blocks hotlinking.

Closes #1563
Closes #2363

Key design decisions

  • Default OFF — gated behind CRAWLER_STORE_CONTENT_IMAGES=false env var so existing users aren't surprised by extra storage/bandwidth
  • Separate worker — not inline in the crawler, for reliability isolation (a single bad image doesn't block the crawl)
  • Sequential downloads per bookmark to reduce rate-limiting from origin servers
  • Deterministic asset IDsSHA-256(bookmarkId:sourceUrl).slice(0,32) enables skip-if-exists on retries and idempotent re-crawls
  • Per-image exponential backoff — up to 10 retries, 1s initial / 30s max cap (~3 min worst case per image)
  • Browser-like request headers — Chrome UA, Accept: image/*, Referer from source page
  • Magic bytes detection — falls back to file signature detection when servers return wrong Content-Type
  • Extended format support — JPEG, PNG, GIF, WebP + SVG, AVIF, APNG (worker-scoped CONTENT_IMAGE_ASSET_TYPES, does not alter global IMAGE_ASSET_TYPES)
  • Content images are system-managed — hidden from AttachmentBox, isAllowedToAttach/Detach both return false
  • Stale image cleanup on re-crawl — after successfully downloading new images and rewriting HTML, assets no longer referenced in the current HTML are deleted (both file and DB row). Only runs after at least one image was successfully cached so partial failures preserve old working images.

Changes

New files

  • apps/workers/workers/contentImageWorker.ts — the worker (extraction, download, rewrite, stale cleanup)
  • apps/workers/workers/contentImageWorker.test.ts — 91 unit tests
  • apps/workers/workers/contentImageWorker.integration.test.ts — 5 integration tests with real HTTP server
  • packages/db/drizzle/0081_add_content_image_status.sql — adds nullable contentImageStatus column to bookmarkLinks

Modified files

  • packages/db/schema.tscontentImageStatus column (pending/failure/success), CONTENT_IMAGE asset type
  • packages/shared/config.ts — 5 new env vars for feature configuration
  • packages/shared/assetdb.tsSUPPORTED_CONTENT_IMAGE_TYPES set, optional supportedTypes param on saveAsset (both store implementations)
  • packages/shared-server/src/queues.tsContentImageQueue definition
  • packages/trpc/routers/admin.ts — content image stats, bulk recacheContentImages mutation, per-bookmark adminRecacheContentImagesBookmark mutation, contentImageStatus in debug info output
  • packages/trpc/models/bookmarks.tscontentImageStatus in buildDebugInfo response
  • packages/trpc/stats.ts — content image queue stats
  • packages/trpc/index.ts — export content image queue
  • packages/trpc/lib/attachments.tsCONTENT_IMAGE not user-attachable
  • apps/workers/workers/crawlerWorker.ts — enqueue content image job after crawl completes
  • apps/workers/index.ts — register content image worker
  • apps/web/components/admin/BackgroundJobs.tsx — Content Image Jobs card (hidden when feature disabled) with pending/failed counters, bulk recache actions
  • apps/web/components/admin/BookmarkDebugger.tsx — per-bookmark "Re-cache images" action button, Image Crawl Status badge in status section
  • apps/web/components/dashboard/preview/AttachmentBox.tsx — filter CONTENT_IMAGE assets from user-facing attachment list
  • apps/web/lib/i18n/locales/en/translation.json — i18n strings
  • apps/web/lib/i18n/locales/en_US/translation.json — i18n strings
  • docs/docs/03-configuration/01-environment-variables.md — new env var docs
  • docs/docs/08-development/04-architecture.md — updated worker count

Image extraction capabilities

  • Lazy-load attributes (12 total): data-src, data-actualsrc, data-srv, data-original, data-lazy, data-lazy-src, data-lazyload, data-img-src, data-url, data-hi-res-src, data-highres, data-full-src
  • Srcset parsing: srcset, data-srcset, data-lazy-srcset — picks largest candidate by width/density
  • SVG support: <image href> and <image xlink:href> extraction and rewriting
  • Cleanup: strips srcset attrs, removes <source> inside <picture>, cleans lazy-load attrs after rewriting

Configuration

Env var Default Description
CRAWLER_STORE_CONTENT_IMAGES false Enable/disable content image caching
CRAWLER_CONTENT_IMAGE_MAX_COUNT 50 Max images to cache per bookmark
CRAWLER_CONTENT_IMAGE_MAX_SIZE_MB 5 Max size per image in MB
CONTENT_IMAGE_NUM_WORKERS 1 Number of content image worker instances
CONTENT_IMAGE_JOB_TIMEOUT_SEC 120 Job timeout in seconds

Test plan

  • 91 unit tests covering extraction, rewriting, download, magic bytes, run pipeline, stale cleanup
  • 5 integration tests with real HTTP server serving minimal valid images
  • Manual test: enable feature, bookmark a page with images, verify images are cached and visible in reader view
  • Manual test: take source page offline, verify cached images still render
  • Manual test: re-crawl a bookmark, verify stale images are cleaned up and new images are cached
  • Manual test: admin panel shows correct pending/failed counters and bulk recache works
  • Manual test: bookmark debugger shows Image Crawl Status and per-bookmark Re-cache images button works

Known limitations

  • Re-cache images without a preceding re-crawl is a no-op if images were already rewritten (HTML contains /api/assets/ URLs, not external URLs). Use Re-crawl first to fetch fresh HTML.
  • maxCount truncation may cause images beyond the limit to never be cached
  • Bookmark deletion cascades DB rows but not asset files (handled by existing tidyAssets maintenance)
  • Video poster and CSS background-image not handled (intentionally scoped out)

Migration

DB migration 0081_add_content_image_status adds a nullable contentImageStatus column to bookmarkLinks. No data migration needed — existing rows default to NULL (not processed).

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 18, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c97fbae4-cf7c-407f-ab9b-4b28cc206ef1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

anpryl and others added 17 commits March 19, 2026 10:34
When CRAWLER_STORE_CONTENT_IMAGES is enabled, images found in extracted
reader HTML are downloaded, stored as local assets, and the HTML is
rewritten to reference local asset URLs. This allows reader content to
render fully offline. Includes per-bookmark image count/size limits,
storage quota enforcement, and bounded download concurrency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unnecessary ZAssetType import and hiddenAssetType intermediate
variable — TypeScript already validates string literals through the
asset type's own type definition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hitecture docs

The file picker accept attributes had ".jgp" instead of ".jpg", preventing
users from uploading lowercase .jpg files for banner images and replacements.
Also added the content image caching worker to the architecture documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The doc said "three job types" but listed four after adding content
image caching. Changed to "some of the job types include" since the
doc only lists a subset of the ~13 actual worker types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use deterministic asset IDs (SHA-256 of bookmarkId:sourceUrl) to enable
skip-if-exists on retries and idempotent re-crawls via upsert. Download
images sequentially with per-image exponential backoff (10 retries,
30s cap) to handle 429 rate limiting. Add browser-like headers
(Chrome 146 UA, Accept image/*, Referer) to reduce rate limiting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add queue stats and "Cache Content Images for All Links" button to the
Background Jobs admin page, allowing admins to trigger bulk content
image caching for all existing bookmarks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only show the Content Image Jobs card in the admin panel when
CRAWLER_STORE_CONTENT_IMAGES is enabled, to avoid confusing admins
who haven't opted in.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Accept header already requested these types from servers but the
content-type allowlist rejected them. Add a worker-scoped
CONTENT_IMAGE_ASSET_TYPES set so these formats are cached without
changing the global IMAGE_ASSET_TYPES used for user uploads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add contentImageStatus column to bookmarkLinks so the admin panel can
show pending/failed counters for content image jobs, matching the
pattern already used by crawlStatus for the crawler.

- Schema: nullable contentImageStatus (pending | failure | success)
- Worker: set success in onComplete, failure in onError (final retry)
- Enqueue: set pending when queueing from crawler and admin bulk action
- Admin API: query pending/failed counts, expose in stats response
- Frontend: pass full stats to Content Image Jobs card

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Some servers respond with non-image Content-Type headers (text/html,
application/octet-stream) for actual images. Fall back to checking
file signatures (JPEG, PNG, GIF, WebP, SVG, AVIF) before rejecting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a "Recache Failed Links Only" button to the Content Image Jobs
card, matching the pattern from the Crawler Jobs card. The existing
bulk action now requires an explicit contentImageStatus filter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After successfully downloading new images and rewriting HTML,
delete content image assets that are no longer referenced in
the current HTML. This prevents orphaned files from accumulating
when page content changes between crawls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd SVG

Align contentImageWorker's LAZY_SRC_ATTRS with the preprocessor's list
to handle plugin paths (e.g. Reddit) that bypass normalizeLazyLoadImages.
Add srcset/data-srcset/data-lazy-srcset parsing (picks largest candidate),
high-res attrs (data-hi-res-src, data-highres, data-full-src), and SVG
<image href/xlink:href> extraction and rewriting.

Refactor downloadImage to accept an options object with configurable
maxRetries, fixing two tests that timed out with the hardcoded MAX_RETRIES=10.

Add integration tests with a real HTTP server serving minimal valid images
for all supported patterns (23 URLs across 7 formats, 12 lazy-load attrs,
3 srcset variants, and SVG image elements).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@anpryl anpryl force-pushed the cache-reader-content-images branch from ee71d3b to 0fc3876 Compare March 19, 2026 08:38
anpryl and others added 4 commits March 19, 2026 13:18
The saveAsset function validates content types against a hardcoded
allowlist that excludes SVG, AVIF, and APNG. Content images with these
types would be downloaded but fail to persist silently.

Add SUPPORTED_CONTENT_IMAGE_TYPES set extending the base allowlist,
and an optional supportedTypes parameter to saveAsset so callers can
widen the allowlist without changing defaults for uploads/bookmarks.
SVGs are safe here because serveAsset applies sandbox CSP headers
that block script execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The bookmark debugger only had bulk recache actions. This adds a
per-bookmark "Re-cache images" button so admins can trigger content
image caching for a single bookmark without recaching everything.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Surfaces the per-bookmark contentImageStatus field in the Status
section so admins can see whether image caching succeeded, failed,
or is pending. Only shown when the field is non-null (i.e. image
crawling has been triggered for that bookmark).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mobile app and other API-key clients can't resolve relative /api/assets/
URLs or attach auth headers to <img> requests in rendered HTML. When the
request comes through Bearer token auth and includeContent is true, replace
asset URLs with base64 data URIs by reading assets from storage at serve
time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant