Skip to content

fix(apiserver): resolve VD GET cache-miss via live RD re-read (aligns with RD GET)#172

Merged
Andrei Kvapil (kvaps) merged 1 commit into
mainfrom
harden/store-cache-miss-fallback
Jul 2, 2026
Merged

fix(apiserver): resolve VD GET cache-miss via live RD re-read (aligns with RD GET)#172
Andrei Kvapil (kvaps) merged 1 commit into
mainfrom
harden/store-cache-miss-fallback

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jul 1, 2026

Copy link
Copy Markdown
Member

What

Volume-definitions are embedded in the RD CRD spec, so after the v0.1.15 RD-404 fix the same committed RD answered differently by path on a multi-replica apiserver: GET /v1/resource-definitions/{rd} resolved through the RD store's uncached fallback, while GET .../volume-definitions/{vn} read only the lagging informer cache and 404'd — observed live on the cozystack e2e stand (GET .../volume-definitions/0 -> 404 with the REST cache-retry budget exhausted). volumeDefinitions.Get now does one live re-read via the existing fetchRDLive on a cache miss OR a stale cached RD revision, mirroring resourceDefinitions.Get.

Why the scope is VD-only (down from the first revision)

The first revision extended the uncached fallback to all stores (resources/nodes/resource-groups/storage-pools/snapshots). CI caught a real regression: the snapshot-pagination test failed (page2: got 1 entries, want 2) because the store-level fallback short-circuits the REST layer's get*WithCacheRetry convergence wait — the create flow's read-back resolved uncached while the cached List still under-reported. The architecture is deliberately two-tier:

  • raw store Gets are cache-only: REST existence probes rely on a fast cached NotFound, and read-your-writes flows rely on Get/List coherence;
  • cross-replica lag is absorbed at the REST layer by get*WithCacheRetry, which POLLS the cached read so subsequent cached Lists stay coherent.

The VD case is the exception because it is a consistency alignment with the already-released RD behaviour (same underlying CRD object), and its 404 was observed live with the REST retry budget exhausted. NewWithAPIReader's doc comment now pins this boundary for future readers.

Testing

L1 regression tests (fail-on-bug proven: both FAIL on the pre-fix tree, PASS with the fix): the stale-RD embedded-VD mode, the RD-cache-miss mode, and nil-reader NotFound preservation. Full pkg/store/k8s green, lint 0 issues; the two integration tests that failed on the first revision's CI run (TestGroupJ/CSIListSnapshotsPagination, TestGroupKWFSpawnAndDependentReAutoplace) pass 3/3 locally with this revision.

Note on tiers: store-layer robustness change, not a CLI-verb change — the cache-lag race cannot be reproduced deterministically on the single-replica L6/L7 stand, so L1 internal tests are the appropriate tier (same as the v0.1.15 RD-404 fix).

Summary by CodeRabbit

  • Bug Fixes

    • Improved volume-definition lookups to better handle stale cache misses and return the latest available data when possible.
    • Reduced false “not found” results when cached data lags behind recent changes.
    • Preserved existing behavior when live re-read access is unavailable or when the error is unrelated to a cache miss.
  • Tests

    • Added coverage for cache-miss and stale-cache scenarios to verify lookup behavior with and without live fallback access.
  • Documentation

    • Clarified lookup fallback behavior and consistency expectations in the related API notes.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 55fe7b37-0627-4c03-bec6-7dcbf25d71fd

📥 Commits

Reviewing files that changed from the base of the PR and between 1e9501e and 29e71d5.

📒 Files selected for processing (3)
  • pkg/store/k8s/k8s.go
  • pkg/store/k8s/store_cache_miss_fallback_internal_test.go
  • pkg/store/k8s/volume_definitions.go

📝 Walkthrough

Walkthrough

Adds a cache-miss fallback to volumeDefinitions.Get that retries with a live (uncached) RD read via apiReader when the cached read returns NotFound, introduces a vdByNumber lookup helper, expands related documentation, and adds internal unit tests covering stale-RD and RD-cache-miss scenarios.

Changes

Cache-miss fallback for VolumeDefinitions.Get

Layer / File(s) Summary
Fallback logic and lookup helper
pkg/store/k8s/volume_definitions.go
Get now retries with a live uncached RD re-read when the cached fetch returns NotFound and apiReader is configured, using new vdByNumber helper to locate the volume definition on both cached and live RD reads.
Fallback behavior documentation
pkg/store/k8s/k8s.go
Expands the NewWithAPIReader documentation comment to explain the uncached fallback's purpose and constraints, clarifying it should not be generalized to other store Get operations.
Unit tests for fallback scenarios
pkg/store/k8s/store_cache_miss_fallback_internal_test.go
Adds fake-client test helpers and two tests validating fallback behavior for a stale RD cache and a missing RD in cache, both with and without apiReader configured.

Estimated code review effort: 3 (Moderate) | ~20 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant VolumeDefinitionsGet as volumeDefinitions.Get
  participant CachedClient
  participant APIReader

  Caller->>VolumeDefinitionsGet: Get(volumeNumber)
  VolumeDefinitionsGet->>CachedClient: fetch RD (cached)
  CachedClient-->>VolumeDefinitionsGet: NotFound / stale RD
  alt apiReader configured and NotFound
    VolumeDefinitionsGet->>APIReader: live re-read RD
    APIReader-->>VolumeDefinitionsGet: fresh RD
    VolumeDefinitionsGet->>VolumeDefinitionsGet: vdByNumber(freshRD)
  else no apiReader or other error
    VolumeDefinitionsGet->>VolumeDefinitionsGet: return error immediately
  end
  VolumeDefinitionsGet-->>Caller: VolumeDefinition or ErrNotFound
Loading
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch harden/store-cache-miss-fallback

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements cache-miss fallback logic across various Kubernetes store implementations (nodes, resourceGroups, resources, snapshots, storagePools, and volumeDefinitions) to prevent spurious NotFound errors caused by multi-replica apiserver cache lag. When a resource is not found in the informer cache, the store now falls back to a direct live read using the apiReader. A comprehensive set of unit tests has been added to verify this fallback behavior. I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

… with RD GET)

Volume-definitions are embedded in the RD CRD spec, so after the
v0.1.15 RD-404 fix the SAME committed RD answered differently by
path on a multi-replica apiserver: `GET /v1/resource-definitions/{rd}`
resolved through the RD store's uncached fallback while
`GET .../volume-definitions/{vn}` read only the lagging informer cache
and 404'd — observed live on the cozystack e2e stand
(`GET .../volume-definitions/0 -> 404`, REST cache-retry budget
exhausted). volumeDefinitions.Get now does one live re-read via the
existing fetchRDLive on a cache miss OR a stale cached RD revision,
mirroring resourceDefinitions.Get.

Deliberately NOT extended to the other stores: raw store Gets are
cache-only by design — REST existence probes rely on a fast cached
NotFound, and get*WithCacheRetry absorbs cross-replica lag at the REST
layer while preserving read-your-writes for subsequent cached Lists.
A store-level fallback short-circuits that convergence wait: trying it
across all stores made the CI snapshot-pagination test regress (the
create flow's read-back resolved uncached while the cached List still
under-reported). NewWithAPIReader's doc now pins this boundary.

Regression tests (L1, fail-on-bug proven — both FAIL on the pre-fix
tree): the stale-RD embedded-VD mode, the RD-cache-miss mode, and
nil-reader NotFound preservation.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) force-pushed the harden/store-cache-miss-fallback branch from acf3160 to 29e71d5 Compare July 2, 2026 00:14
@kvaps Andrei Kvapil (kvaps) changed the title fix(apiserver): complete the cache-miss NotFound fallback across all store read paths fix(apiserver): resolve VD GET cache-miss via live RD re-read (aligns with RD GET) Jul 2, 2026
@kvaps Andrei Kvapil (kvaps) marked this pull request as ready for review July 2, 2026 01:03
@kvaps Andrei Kvapil (kvaps) merged commit 56dadb3 into main Jul 2, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant