Skip to content

Add proxy cache refresh support with safe key handling and refresh accounting#60

Open
yaoge123 wants to merge 43 commits into
nginx-modules:masterfrom
yaoge123:replay
Open

Add proxy cache refresh support with safe key handling and refresh accounting#60
yaoge123 wants to merge 43 commits into
nginx-modules:masterfrom
yaoge123:replay

Conversation

@yaoge123
Copy link
Copy Markdown

@yaoge123 yaoge123 commented Apr 24, 2026

Summary

  • add a full proxy cache refresh flow on top of upstream purge support, including exact refresh, partial refresh, and refresh_all
  • split dedicated refresh configuration from purge configuration, add runtime method routing, and document the supported deployment layouts
  • harden refresh for large batches, concurrent requests, client aborts, transport failures, cache-key safety, and add detailed accounting / observability

Recommended Review Order

  1. Refresh core and batch invalidation foundation
  2. Execution model and directive split
  3. Observability and accounting
  4. Cache-key safety and request-target inference
  5. Hardening and bug fixes
  6. Repository hygiene, tests, and docs

Main Behavior

This stack adds refresh as conditional cache validation for existing cached objects. For each cached entry in scope, the module probes upstream and keeps the cache entry when upstream reports it is unchanged, but purges the cache entry when upstream reports it changed or disappeared.

Current refresh status policy is:

  • 304: keep the cached entry
  • 200: purge the cached entry because upstream content changed
  • 404 / 410: purge the cached entry because the upstream object is gone
  • other HTTP statuses: keep the cached entry
  • transport / internal failures: keep the cached entry and count a real error

Configuration Model

This series ends with two supported operator-facing models:

  • dedicated refresh endpoints via proxy_cache_refresh
  • migration-friendly single-entry routing where exact / partial requests can switch between purge and refresh by HTTP method

refresh_all remains explicitly configured through proxy_cache_refresh and is not enabled on a plain purge location by accident.

Safety Rules

Refresh only runs for proxy cache paths and only when the evaluated cache key keeps the request-target tail unambiguous. The series includes:

  • strict cache-key tail validation
  • rejection of ambiguous partial refresh requests with query args
  • support for safe prefixed cache keys when the request-target tail can still be reconstructed safely
  • runtime rejection of refresh on purge entrypoints that do not actually define proxy_cache

Hardening Included In This Stack

The series also fixes and hardens a number of failure cases discovered during development and regression work:

  • large-batch refresh scanning and invalidation stability
  • concurrent refresh race handling
  • stale cache struct and merge-loc-conf crash fixes
  • client-abort safety for refresh requests
  • correct transport failure accounting
  • temp pool / fd buildup cleanup during large invalidations
  • relaxed scanned-node replacement checks to avoid false positives in refresh flows

Observability

Refresh responses and logs now expose richer accounting:

  • X-Cache-Action response header for purge vs refresh actions
  • top-level kept / purged / errors
  • dynamic status_counts
  • dynamic status_bytes
  • total_bytes / kept_bytes / purged_bytes
  • refresh summary logs at error_log ... notice

Verification

Validation completed on the fork before proposing upstream review, including:

  • Docker-based Test::Nginx suite via make test
  • broader regression coverage via repo-level scripts/run_regression.sh
  • compatibility runs on:
    • 1.20.2
    • 1.22.1
    • 1.24.0
    • 1.26.3
    • 1.28.3
    • 1.29.8

yaoge123 added 30 commits April 24, 2026 18:01
- Compare invalidation metadata directly against in-memory cache struct
  instead of re-reading the cache file, eliminating a TOCTOU race window
- Add inode (uniq) and body_start cross-checks for race detection
- Defer temp pool destruction via enqueue/drain pattern to prevent
  use-after-free when subrequests reference pool-allocated data
- Register pool cleanup handler to guarantee resource release on
  request termination (timeout, client disconnect)
- Retire chunk pools instead of immediate destruction so in-flight
  subrequests retain valid file references across chunk boundaries
- Add active count underflow guard in refresh_done normal path
- Add args.len underflow guard for malformed cache keys
- Use case-sensitive ngx_strncmp for cache key matching (keys are
  exact byte sequences, not case-insensitive identifiers)
- Add bounds checks for config parsing (from keyword position)
- Fix memory leak: free key_buf on short read in collect_path
- Track dispatched subrequest count separately from queue cursor
  for accurate timeout skip accounting
Root cause: ngx_http_file_cache_open registers a pool cleanup
(ngx_http_file_cache_cleanup) that holds a pointer to the
ngx_http_cache_t struct.  When this struct was a stack variable in
invalidate_item, the pointer became dangling after the function
returned.  Later, drain_temp_pools destroyed the pool, triggering
the cleanup which accessed the stale stack address -> SIGSEGV.

Fix:
- Allocate ngx_http_cache_t on the pool instead of the stack so it
  survives until pool destruction
- Call ngx_http_file_cache_free after invalidate_opened_cache to
  properly release the node reference (sets c->updated=1, making
  the pool cleanup a no-op)
- Remove count-- from invalidate_opened_cache to avoid premature
  node free when concurrent refreshes race on the same cache entry

Verified: 3-5 concurrent refresh requests to overlapping cache zones
no longer crash.  Full regression suite and 100k perf pass.
- Add finalized guard in refresh_done to prevent use-after-free on timeout
- Fix item_matches_cache to allow invalidate when all metadata fields are zero
- Set pd->handled in fire_subrequest error paths to prevent double error counting
- Set file.name in read_item and collect_path for proper error logging
- Add ngx_memzero for batch_ctx in purge_all and purge_partial
- Add detailed comment on open_temp_cache shallow copy risks
- Add comprehensive refresh documentation to README.md
Normal (WAITED) subrequests accumulate in r->postponed and only finalize
when they become the active subrequest via the postpone filter.  Since
our write_event_handler never outputs data, the postpone filter never
runs, causing r->main->count to grow unboundedly and overflow at 64535
for file sets larger than ~64k entries.

Switch to NGX_HTTP_SUBREQUEST_BACKGROUND which bypasses the postpone
list entirely.  Each background subrequest finalizes independently via
ngx_http_close_request, properly decrementing main->count.

Since background subrequests do not post the parent request upon
finalization, manually call ngx_http_post_request() in refresh_done
to re-trigger the dispatch loop on the next event cycle.

Remove the previous count-limit throttle which degraded throughput
from ~1900 fps to ~330 fps.

Verified: 100k files refresh completes in 8.5s (~11700 fps) with
total=100000 kept=50000 purged=50000 errors=0.  Full regression
suite (10/10) passes.
nginx's cache_convert_head converts HEAD to GET by setting u->method
in ngx_http_upstream_cache() before the cache_bypass predicate check.
Since proxy_create_request() prioritizes u->method over r->method_name,
our HEAD setting was being overridden to GET.

Fix by clearing u->method in the $cache_purge_refresh_bypass variable
handler, which is evaluated by ngx_http_test_predicates() during the
cache_bypass check — after u->method is set but before create_request()
reads it.  This makes proxy_create_request() fall through to
r->method_name (HEAD).

Only affects refresh subrequests (checked via parent ctx and
r->method == NGX_HTTP_HEAD).  Normal requests are unaffected.

Verified: upstream access log confirms HEAD requests with 0 body bytes.
304 responses have no body; 200 responses also transfer 0 bytes.
Full regression suite (10/10) passes.
Copilot AI review requested due to automatic review settings April 24, 2026 10:47
@yaoge123 yaoge123 changed the title tmp pr probe Add proxy cache refresh support with safe key handling and refresh accounting Apr 24, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the module and documentation to support a new refresh capability (conditional cache validation) alongside existing purge behavior, and refreshes the nginx compatibility/testing matrix.

Changes:

  • Add proxy_cache_refresh support with refresh-time controls (cache_purge_refresh_concurrency, cache_purge_refresh_timeout) and method-driven runtime routing between purge/refresh.
  • Rework cache invalidation helpers to safely invalidate scanned entries and add refresh subrequest orchestration + summary responses.
  • Update docs/testing scaffolding: new refresh documentation (EN + zh-CN), expanded nginx version matrix, and add ignore rules for build/test artifacts.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ngx_cache_purge_module.c Implements refresh feature, new directives, runtime dispatch, and refactors invalidation logic.
t/Makefile Updates default nginx version and expands the test compatibility version list.
README.md Documents refresh directives/rules and updates compatibility/testing guidance.
README.zh-CN.md Adds detailed deployment documentation for purge/refresh behavior in Chinese.
.gitignore Ignores .worktrees/.
.dockerignore Excludes test/cache directories from Docker build context.

Comment thread ngx_cache_purge_module.c Outdated
tree.log = ngx_cycle->log;

ngx_walk_tree(&tree, &cache->path->name);
return ctx.files_deleted;
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ngx_http_cache_purge_partial(), ngx_http_cache_purge_drain_temp_pools(&ctx.temp_pools) is unreachable because the function returns immediately after ngx_walk_tree(). This leaves dead code and (if temp_pools is ever populated) would skip cleanup. Move the drain before the return and keep a single return statement.

Suggested change
return ctx.files_deleted;

Copilot uses AI. Check for mistakes.
Comment thread ngx_cache_purge_module.c Outdated
Comment on lines +5296 to +5305
* Note: nginx's cache_convert_head (on by default) converts HEAD to GET
* in ngx_http_upstream_cache() before the cache_bypass check, so the
* upstream actually receives a GET request. However, header_only = 1
* ensures nginx finalizes the subrequest after receiving response headers
* without reading the body (ngx_http_upstream.c:3280 fast-path when
* header_only && !cacheable && !store). The real bandwidth savings come
* from conditional headers (If-None-Match / If-Modified-Since) which
* elicit 304 responses with no body. Forcing true HEAD would require
* hooking into the upstream init phase or setting proxy_cache_convert_head
* off globally, both unacceptably invasive.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above the subrequest method setup says the upstream "actually receives a GET" due to cache_convert_head, but the $cache_purge_refresh_bypass variable handler explicitly clears r->upstream->method to restore HEAD before proxy_create_request() runs. These two comments contradict; update this block to reflect the current behavior (and/or reference the bypass variable hack) so future readers don’t get misled.

Suggested change
* Note: nginx's cache_convert_head (on by default) converts HEAD to GET
* in ngx_http_upstream_cache() before the cache_bypass check, so the
* upstream actually receives a GET request. However, header_only = 1
* ensures nginx finalizes the subrequest after receiving response headers
* without reading the body (ngx_http_upstream.c:3280 fast-path when
* header_only && !cacheable && !store). The real bandwidth savings come
* from conditional headers (If-None-Match / If-Modified-Since) which
* elicit 304 responses with no body. Forcing true HEAD would require
* hooking into the upstream init phase or setting proxy_cache_convert_head
* off globally, both unacceptably invasive.
* nginx's cache_convert_head (on by default) rewrites HEAD to GET in
* ngx_http_upstream_cache() before the cache_bypass check. For refresh
* subrequests, this module compensates via the $cache_purge_refresh_bypass
* variable handler, which clears r->upstream->method before
* proxy_create_request() runs so the upstream request is restored to
* HEAD. header_only = 1 still matters because nginx finalizes the
* subrequest after receiving response headers without reading the body
* (ngx_http_upstream.c fast-path when header_only && !cacheable &&
* !store). The real bandwidth savings come from conditional headers
* (If-None-Match / If-Modified-Since) which elicit 304 responses with
* no body. Using the bypass-variable hook avoids more invasive changes
* such as upstream-phase hooks or disabling proxy_cache_convert_head
* globally.

Copilot uses AI. Check for mistakes.
Comment thread README.md

```
Syntax: cache_purge_refresh_concurrency <number>
Default: 1
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README states cache_purge_refresh_concurrency has a default of 1, but the module code merges refresh_concurrency with a default of 32 (ngx_cache_purge_module.c:4315-4316). Please align documentation and implementation (either update the documented default or change the code default).

Suggested change
Default: 1
Default: 32

Copilot uses AI. Check for mistakes.
Comment thread ngx_cache_purge_module.c Outdated
Comment on lines 6063 to 6086
static ngx_int_t
ngx_http_cache_purge_refresh(ngx_http_request_t *r,
ngx_http_file_cache_t *cache)
{
ngx_http_cache_purge_refresh_ctx_t *ctx;
ngx_http_cache_purge_loc_conf_t *cplcf;
ngx_pool_cleanup_t *cln;
ngx_str_t *keys;
ngx_str_t targets[4];
ngx_str_t key;
ngx_uint_t ntargets;

ngx_log_debug(NGX_LOG_DEBUG_HTTP, r->connection->log, 0,
"cache purge refresh in %s", cache->path->name.data);

/* Allocate refresh context */
ctx = ngx_pcalloc(r->pool, sizeof(ngx_http_cache_purge_refresh_ctx_t));
if (ctx == NULL) {
return NGX_HTTP_INTERNAL_SERVER_ERROR;
}
# endif

# if (NGX_HTTP_UWSGI)
ngx_http_cache_purge_merge_conf(&conf->uwsgi, &prev->uwsgi);
ngx_queue_init(&ctx->temp_pools);
ngx_queue_init(&ctx->retired_chunk_pools);

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces substantial new refresh behavior (proxy_cache_refresh directive, conditional subrequests, cache scanning/invalidation), but there are no Test::Nginx cases covering REFRESH requests yet (no matches for "REFRESH" under t/). Add integration tests for core scenarios (304 keeps, 200/404/410 invalidates, capability mismatch 400, partial vs refresh_all) to prevent regressions across nginx versions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the current replay branch.

Added Test::Nginx coverage in t/basic.t for the core refresh paths this PR introduces:

  • exact refresh keeps cached content on 304
  • exact refresh invalidates on 200
  • exact refresh invalidates on 404
  • exact refresh invalidates on 410
  • partial refresh purges only changed matches
  • refresh_all refreshes changed entries across the cache
  • runtime capability mismatch returns 400 for refresh_all on a purge-only location

Fresh verification: make test-version VERSION=1.29.8 now passes with the new cases included.

Comment thread ngx_cache_purge_module.c Outdated
Comment on lines +86 to +87
"<html><head><title>Successful purge</title></head>"
"<body bgcolor=\"white\"><center><h1>Successful purge</h1>"
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTML response template hard-codes "Successful purge" in the <title>/

, but this handler can also return statuses like "queued" (background purge) and potentially other non-success states. This makes the response body misleading; consider using a neutral heading (e.g., "Cache purge") or deriving the heading from the actual status string.

Suggested change
"<html><head><title>Successful purge</title></head>"
"<body bgcolor=\"white\"><center><h1>Successful purge</h1>"
"<html><head><title>Cache purge</title></head>"
"<body bgcolor=\"white\"><center><h1>Cache purge</h1>"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

The shared HTML template now uses the neutral heading Cache purge instead of Successful purge, so queued/background responses no longer claim success in the title.

Added TEST 4a in t/background_queue.t to cover the queued HTML response explicitly, and verified it with a fresh make test-version VERSION=1.29.8 run.

denji and others added 3 commits April 25, 2026 21:37
The cache_purge_throttle_ms directive uses ngx_conf_set_msec_slot, which
treats bare integers as seconds per the standard nginx time-value
contract. This caused a value of "10" to be interpreted as 10
seconds instead of the intended 10 milliseconds.

This patch updates the documentation, internal comments, and test suites
to explicitly use the "ms" suffix (e.g., "10ms"). It also
clarifies that while the directive requires a suffix for millisecond
precision, the internal default constant is still handled as raw
milliseconds.

Resolved nginx-modules#61.
The cache_purge_response_type directive had a flawed duplicate detection
guard that only checked for NGX_HTTP_LOC_CONF. This silently allowed
duplicate directives in http and server contexts to overwrite previous
values without throwing a configuration error. Additionally, a defunct
guard attempting to check for NGX_HTTP_MODULE was comparing context
flags against module-type magic bits, rendering it dead code.

Remove the invalid NGX_HTTP_MODULE check and make duplicate detection
unconditional. The directive is intentionally valid in http, server,
and location contexts, and merge_loc_conf propagates the value downward
appropriately. The README context documentation is updated to reflect
this correctly.

Furthermore, rename the internal configuration variable from resptype
to response_type for clarity. Fix a typo in the response type macro
definitions (NGX_REPONSE_TYPE_* to NGX_CACHE_PURGE_RESPONSE_TYPE_*)
and properly namespace them.

Clarify the documentation and internal comments for the
cache_purge_throttle_ms directive. Bare integers are evaluated as
milliseconds natively to match the directive's name, which differs
from the standard nginx time parser, logging a startup warning.

Add comprehensive test coverage in config.t to ensure proper context
inheritance, duplicate config rejection, and explicit time suffix parsing.
@denji
Copy link
Copy Markdown
Member

denji commented Apr 26, 2026

@yaoge123 Thanks for the massive work here! I'm currently reviewing it. One architectural thought: rather than adding 2000+ lines to ngx_cache_purge_module.c, what do you think about splitting the new refresh mechanics into its own file (e.g., ngx_http_cache_purge_refresh.c or ngx_http_cache_refresh.c) with a shared .h header?

This would isolate the complex subrequest and validation logic from the legacy purge code, reducing the risk of regressions. We'll still compile it down to a single ngx_http_cache_purge_module.so. Let me know if you're open to this refactoring before the final review.

@yaoge123
Copy link
Copy Markdown
Author

I checked the current refresh implementation structure carefully and agree that a split will make the review and long-term maintenance safer.

I’m going to take the narrowest refactor first: move the refresh execution engine (scan/collect/subrequest/timeout/finalize flow) into a dedicated source file, while keeping the existing module commands, config parsing, access routing, and shared purge/invalidate glue in the main module file.

That should reduce the size and complexity of ngx_cache_purge_module.c without turning this into a broad architectural rewrite in the same PR. After the split I’ll rerun the existing Test::Nginx coverage before updating the PR again.

@yaoge123
Copy link
Copy Markdown
Author

Implemented the narrow split.The refresh execution engine now lives in ngx_http_cache_refresh.c, while module commands, config parsing, access routing, and the shared purge/invalidate glue remain in ngx_cache_purge_module.c.Fresh verification after the split:- make test-version VERSION=1.29.8- make test-compatThe full compatibility matrix passed after the refactor (1.20.2, 1.22.1, 1.24.0, 1.26.3, 1.28.3, 1.30.0, 1.29.8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants