feat: cache reader content images for offline access by anpryl · Pull Request #2595 · karakeep-app/karakeep

anpryl · 2026-03-18T00:07:54Z

Summary

Adds a new contentImageWorker that downloads images referenced in extracted reader content HTML, saves them as CONTENT_IMAGE assets, and rewrites <img src> URLs to point to /api/assets/{assetId} — so reader view images survive even when the original source goes offline or blocks hotlinking.

Closes #1563
Closes #2363

Key design decisions

Default OFF — gated behind CRAWLER_STORE_CONTENT_IMAGES=false env var so existing users aren't surprised by extra storage/bandwidth
Separate worker — not inline in the crawler, for reliability isolation (a single bad image doesn't block the crawl)
Sequential downloads per bookmark to reduce rate-limiting from origin servers
Deterministic asset IDs — SHA-256(bookmarkId:sourceUrl).slice(0,32) enables skip-if-exists on retries and idempotent re-crawls
Per-image exponential backoff — up to 10 retries, 1s initial / 30s max cap (~3 min worst case per image)
Browser-like request headers — Chrome UA, Accept: image/*, Referer from source page
Magic bytes detection — falls back to file signature detection when servers return wrong Content-Type
Extended format support — JPEG, PNG, GIF, WebP + SVG, AVIF, APNG (worker-scoped CONTENT_IMAGE_ASSET_TYPES, does not alter global IMAGE_ASSET_TYPES)
Content images are system-managed — hidden from AttachmentBox, isAllowedToAttach/Detach both return false
Stale image cleanup on re-crawl — after successfully downloading new images and rewriting HTML, assets no longer referenced in the current HTML are deleted (both file and DB row). Only runs after at least one image was successfully cached so partial failures preserve old working images.

Changes

New files

apps/workers/workers/contentImageWorker.ts — the worker (extraction, download, rewrite, stale cleanup)
apps/workers/workers/contentImageWorker.test.ts — 91 unit tests
apps/workers/workers/contentImageWorker.integration.test.ts — 5 integration tests with real HTTP server
packages/db/drizzle/0081_add_content_image_status.sql — adds nullable contentImageStatus column to bookmarkLinks

Modified files

packages/db/schema.ts — contentImageStatus column (pending/failure/success), CONTENT_IMAGE asset type
packages/shared/config.ts — 5 new env vars for feature configuration
packages/shared/assetdb.ts — SUPPORTED_CONTENT_IMAGE_TYPES set, optional supportedTypes param on saveAsset (both store implementations)
packages/shared-server/src/queues.ts — ContentImageQueue definition
packages/trpc/routers/admin.ts — content image stats, bulk recacheContentImages mutation, per-bookmark adminRecacheContentImagesBookmark mutation, contentImageStatus in debug info output
packages/trpc/models/bookmarks.ts — contentImageStatus in buildDebugInfo response
packages/trpc/stats.ts — content image queue stats
packages/trpc/index.ts — export content image queue
packages/trpc/lib/attachments.ts — CONTENT_IMAGE not user-attachable
apps/workers/workers/crawlerWorker.ts — enqueue content image job after crawl completes
apps/workers/index.ts — register content image worker
apps/web/components/admin/BackgroundJobs.tsx — Content Image Jobs card (hidden when feature disabled) with pending/failed counters, bulk recache actions
apps/web/components/admin/BookmarkDebugger.tsx — per-bookmark "Re-cache images" action button, Image Crawl Status badge in status section
apps/web/components/dashboard/preview/AttachmentBox.tsx — filter CONTENT_IMAGE assets from user-facing attachment list
apps/web/lib/i18n/locales/en/translation.json — i18n strings
apps/web/lib/i18n/locales/en_US/translation.json — i18n strings
docs/docs/03-configuration/01-environment-variables.md — new env var docs
docs/docs/08-development/04-architecture.md — updated worker count

Image extraction capabilities

Lazy-load attributes (12 total): data-src, data-actualsrc, data-srv, data-original, data-lazy, data-lazy-src, data-lazyload, data-img-src, data-url, data-hi-res-src, data-highres, data-full-src
Srcset parsing: srcset, data-srcset, data-lazy-srcset — picks largest candidate by width/density
SVG support: <image href> and <image xlink:href> extraction and rewriting
Cleanup: strips srcset attrs, removes <source> inside <picture>, cleans lazy-load attrs after rewriting

Configuration

Env var	Default	Description
`CRAWLER_STORE_CONTENT_IMAGES`	`false`	Enable/disable content image caching
`CRAWLER_CONTENT_IMAGE_MAX_COUNT`	`50`	Max images to cache per bookmark
`CRAWLER_CONTENT_IMAGE_MAX_SIZE_MB`	`5`	Max size per image in MB
`CONTENT_IMAGE_NUM_WORKERS`	`1`	Number of content image worker instances
`CONTENT_IMAGE_JOB_TIMEOUT_SEC`	`120`	Job timeout in seconds

Test plan

91 unit tests covering extraction, rewriting, download, magic bytes, run pipeline, stale cleanup
5 integration tests with real HTTP server serving minimal valid images
Manual test: enable feature, bookmark a page with images, verify images are cached and visible in reader view
Manual test: take source page offline, verify cached images still render
Manual test: re-crawl a bookmark, verify stale images are cleaned up and new images are cached
Manual test: admin panel shows correct pending/failed counters and bulk recache works
Manual test: bookmark debugger shows Image Crawl Status and per-bookmark Re-cache images button works

Known limitations

Re-cache images without a preceding re-crawl is a no-op if images were already rewritten (HTML contains /api/assets/ URLs, not external URLs). Use Re-crawl first to fetch fresh HTML.
maxCount truncation may cause images beyond the limit to never be cached
Bookmark deletion cascades DB rows but not asset files (handled by existing tidyAssets maintenance)
Video poster and CSS background-image not handled (intentionally scoped out)

Migration

DB migration 0081_add_content_image_status adds a nullable contentImageStatus column to bookmarkLinks. No data migration needed — existing rows default to NULL (not processed).

🤖 Generated with Claude Code

coderabbitai · 2026-03-18T00:08:03Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c97fbae4-cf7c-407f-ab9b-4b28cc206ef1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

When CRAWLER_STORE_CONTENT_IMAGES is enabled, images found in extracted reader HTML are downloaded, stored as local assets, and the HTML is rewritten to reference local asset URLs. This allows reader content to render fully offline. Includes per-bookmark image count/size limits, storage quota enforcement, and bounded download concurrency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove unnecessary ZAssetType import and hiddenAssetType intermediate variable — TypeScript already validates string literals through the asset type's own type definition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…hitecture docs The file picker accept attributes had ".jgp" instead of ".jpg", preventing users from uploading lowercase .jpg files for banner images and replacements. Also added the content image caching worker to the architecture documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The doc said "three job types" but listed four after adding content image caching. Changed to "some of the job types include" since the doc only lists a subset of the ~13 actual worker types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use deterministic asset IDs (SHA-256 of bookmarkId:sourceUrl) to enable skip-if-exists on retries and idempotent re-crawls via upsert. Download images sequentially with per-image exponential backoff (10 retries, 30s cap) to handle 429 rate limiting. Add browser-like headers (Chrome 146 UA, Accept image/*, Referer) to reduce rate limiting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add queue stats and "Cache Content Images for All Links" button to the Background Jobs admin page, allowing admins to trigger bulk content image caching for all existing bookmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Only show the Content Image Jobs card in the admin panel when CRAWLER_STORE_CONTENT_IMAGES is enabled, to avoid confusing admins who haven't opted in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Accept header already requested these types from servers but the content-type allowlist rejected them. Add a worker-scoped CONTENT_IMAGE_ASSET_TYPES set so these formats are cached without changing the global IMAGE_ASSET_TYPES used for user uploads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add contentImageStatus column to bookmarkLinks so the admin panel can show pending/failed counters for content image jobs, matching the pattern already used by crawlStatus for the crawler. - Schema: nullable contentImageStatus (pending | failure | success) - Worker: set success in onComplete, failure in onError (final retry) - Enqueue: set pending when queueing from crawler and admin bulk action - Admin API: query pending/failed counts, expose in stats response - Frontend: pass full stats to Content Image Jobs card Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Some servers respond with non-image Content-Type headers (text/html, application/octet-stream) for actual images. Fall back to checking file signatures (JPEG, PNG, GIF, WebP, SVG, AVIF) before rejecting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a "Recache Failed Links Only" button to the Content Image Jobs card, matching the pattern from the Crawler Jobs card. The existing bulk action now requires an explicit contentImageStatus filter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

After successfully downloading new images and rewriting HTML, delete content image assets that are no longer referenced in the current HTML. This prevents orphaned files from accumulating when page content changes between crawls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nd SVG Align contentImageWorker's LAZY_SRC_ATTRS with the preprocessor's list to handle plugin paths (e.g. Reddit) that bypass normalizeLazyLoadImages. Add srcset/data-srcset/data-lazy-srcset parsing (picks largest candidate), high-res attrs (data-hi-res-src, data-highres, data-full-src), and SVG <image href/xlink:href> extraction and rewriting. Refactor downloadImage to accept an options object with configurable maxRetries, fixing two tests that timed out with the hardcoded MAX_RETRIES=10. Add integration tests with a real HTTP server serving minimal valid images for all supported patterns (23 URLs across 7 formats, 12 lazy-load attrs, 3 srcset variants, and SVG image elements). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The saveAsset function validates content types against a hardcoded allowlist that excludes SVG, AVIF, and APNG. Content images with these types would be downloaded but fail to persist silently. Add SUPPORTED_CONTENT_IMAGE_TYPES set extending the base allowlist, and an optional supportedTypes parameter to saveAsset so callers can widen the allowlist without changing defaults for uploads/bookmarks. SVGs are safe here because serveAsset applies sandbox CSP headers that block script execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The bookmark debugger only had bulk recache actions. This adds a per-bookmark "Re-cache images" button so admins can trigger content image caching for a single bookmark without recaching everything. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Surfaces the per-bookmark contentImageStatus field in the Status section so admins can see whether image caching succeeded, failed, or is pending. Only shown when the field is non-null (i.e. image crawling has been triggered for that bookmark). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mobile app and other API-key clients can't resolve relative /api/assets/ URLs or attach auth headers to <img> requests in rendered HTML. When the request comes through Bearer token auth and includeContent is true, replace asset URLs with base64 data URIs by reading assets from storage at serve time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anpryl and others added 17 commits March 19, 2026 10:34

wip

7d88148

feat: simplify contentImage filter in AttachmentBox

6db2264

Remove unnecessary ZAssetType import and hiddenAssetType intermediate variable — TypeScript already validates string literals through the asset type's own type definition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: verify acceptance criteria for content image caching changes

657dbf2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

move completed plan: 2026-03-16-pr-review-fixes-content-image-caching.md

6f3afac

feat: add content image jobs card to admin panel

1b5c069

Add queue stats and "Cache Content Images for All Links" button to the Background Jobs admin page, allowing admins to trigger bulk content image caching for all existing bookmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: hide content image jobs card when feature is disabled

a22a60f

Only show the Content Image Jobs card in the admin panel when CRAWLER_STORE_CONTENT_IMAGES is enabled, to avoid confusing admins who haven't opted in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: clarify stale cleanup comment to explain placement rationale

94bf385

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anpryl force-pushed the cache-reader-content-images branch from ee71d3b to 0fc3876 Compare March 19, 2026 08:38

anpryl and others added 4 commits March 19, 2026 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: cache reader content images for offline access#2595

feat: cache reader content images for offline access#2595
anpryl wants to merge 21 commits into
karakeep-app:mainfrom
anpryl:cache-reader-content-images

anpryl commented Mar 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 18, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

anpryl commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key design decisions

Changes

New files

Modified files

Image extraction capabilities

Configuration

Test plan

Known limitations

Migration

Uh oh!

coderabbitai Bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anpryl commented Mar 18, 2026 •

edited

Loading

coderabbitai Bot commented Mar 18, 2026 •

edited

Loading