feat: SQL database-backed memoization cache#15938
Conversation
9022071 to
7674c7f
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughIntroduces SQL database-backed memoization caching for PostgreSQL/MySQL as an alternative to Kubernetes ConfigMaps. Adds configuration structures, database migrations, query operations, cache factory integration, controller initialization, garbage collection scheduling, documentation, manifest examples, and end-to-end tests. Changes
Sequence DiagramssequenceDiagram
participant Workflow as Workflow Node
participant Controller as Controller
participant Cache as Cache Factory
participant DB as SQL Database
Workflow->>Controller: Execute node with memoize
Controller->>Cache: GetCache(name)
alt Database configured
Cache->>DB: Load(namespace, cacheName, key)
DB-->>Cache: CacheRecord or nil
Cache-->>Controller: Entry or nil
else No database
Cache->>Cache: Return ConfigMap cache
end
alt Cache hit
Controller-->>Workflow: Return memoized output
else Cache miss
Workflow->>Controller: Execute and produce output
Controller->>Cache: Save(namespace, cacheName, key, outputs)
Cache->>DB: Save record with outputs
DB-->>Cache: Confirmed
end
sequenceDiagram
participant GC as GC Goroutine
participant Cache as Queries
participant DB as SQL Database
loop Every MEMO_CACHE_GC_PERIOD
GC->>Cache: Prune(maxAge=cacheTTL)
Cache->>DB: DELETE WHERE last_hit_at < now - maxAge
DB-->>Cache: Row count deleted
Cache-->>GC: Rows pruned
GC->>GC: Log result
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (4)
workflow/controller/cache/cache.go (1)
78-81: Stale comment for cache type constant.The comment "Only config maps are currently supported for caching" is no longer accurate now that SQL database caching is also supported.
📝 Suggested fix
const ( - // Only config maps are currently supported for caching + // ConfigMapCache is a cache type identifier used as a key prefix in the cache map. + // Despite the name, when a session proxy is configured, SQL-backed caching is used instead. ConfigMapCache Type = "ConfigMapCache" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workflow/controller/cache/cache.go` around lines 78 - 81, The comment above the cache type constant is stale: update the comment for the const block (where Type and ConfigMapCache are defined) to reflect that both ConfigMapCache and SQL/database-backed caching are supported now instead of saying "Only config maps are currently supported for caching"; edit the comment to accurately describe the supported cache types (e.g., config maps and SQL/database caches) so it matches the implementation.docs/memoization.md (1)
106-106: Minor style improvement.Consider a more concise phrasing.
📝 Suggested fix
- In order to use memoization with the ConfigMap backend it is necessary to add the verbs `create` and `update` to the `configmaps` resource for the appropriate (cluster) roles. + To use memoization with the ConfigMap backend, add the verbs `create` and `update` to the `configmaps` resource for the appropriate (cluster) roles.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/memoization.md` at line 106, Reword the sentence about ConfigMap memoization to be more concise: replace the long sentence with a shorter one that states that using memoization with the ConfigMap backend requires adding the create and update verbs to the configmaps resource on the appropriate role (argo-cluster-role for cluster installs, argo-role for namespace installs), and note that this is unnecessary for the SQL backend; update the text around the symbols `configmaps`, `create`, `update`, `argo-cluster-role`, and `argo-role` accordingly.test/e2e/sqldb_memoize_test.go (1)
142-142: Consider removing the unused import workaround.The
_ = db.ErrNoMoreRowsstatement exists solely to satisfy the import checker. If thedbpackage isn't otherwise needed, consider removing the import entirely. If it's needed for clarity or documentation purposes, this workaround is acceptable but non-idiomatic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/sqldb_memoize_test.go` at line 142, The test contains a non-idiomatic unused-import workaround `_ = db.ErrNoMoreRows`; remove the workaround by deleting the import of the `db` package if it's not used elsewhere in test/e2e/sqldb_memoize_test.go, or if `db` is intentionally kept for documentation, replace the assignment with a comment explaining the reasons and ensure no references to `db.ErrNoMoreRows` remain; locate the import and the `_ = db.ErrNoMoreRows` line and either remove the import and the assignment or add a clear comment and use a real reference to `db` in the test to justify keeping the import.util/memo/db/queries_test.go (1)
127-136: Consider increasing sleep duration to reduce potential test flakiness.The 5ms sleep may not provide sufficient timing separation on slower CI runners or under load. PostgreSQL's timestamp precision is microseconds, but the actual DB round-trip latency could exceed 5ms in some cases, causing the
Afterassertion to fail sporadically.💡 Suggested improvement
// Wait a moment so the second load gets a later timestamp. - time.Sleep(5 * time.Millisecond) + time.Sleep(50 * time.Millisecond)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@util/memo/db/queries_test.go` around lines 127 - 136, The 5ms pause in the test may be too short and causes flakiness; update the sleep in the test (the time.Sleep call used before calling q.Load to get `second`) to a larger duration (e.g., 50–100ms) or replace it with a small retry/wait loop that polls until `second.LastHitAt` is strictly after `first.LastHitAt`; target the time.Sleep invocation and the subsequent calls to q.Load/LastHitAt in this test so the assertion assert.True(second.LastHitAt.After(first.LastHitAt), ...) reliably passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config/config.go`:
- Around line 360-372: The doc comment for MemoizationConfig.CacheTTL is
incorrect: current garbage collector logic in memoizationCacheGarbageCollector
treats TTL==0 as "use default (90 days)" and only negative values disable GC;
update the comment on the CacheTTL field (type TTL) to state that 0 means "use
default 90 days" and that a negative value disables GC (or alternatively, if you
prefer behavior from the comment, change memoizationCacheGarbageCollector to
treat 0 as disable and adjust tests). Reference MemoizationConfig, CacheTTL (TTL
type) and memoizationCacheGarbageCollector when making the change.
In `@docs/memoization.md`:
- Around line 57-58: Update the docs text for the cacheTTL behavior to match the
controller: state that cacheTTL defaults to "2160h" (90 days), that setting it
to "0" will use the default TTL, and that only negative values disable automatic
pruning; reference the cacheTTL field name so readers can correlate it with the
controller implementation.
In `@util/memo/db/config.go`:
- Around line 31-36: ConfigFromConfig may dereference a nil cfg via
cfg.SkipMigration; update ConfigFromConfig to guard against nil by checking if
cfg == nil and using a sensible default (e.g., false) for SkipMigration while
still calling TableName(cfg) (which already handles nil). Ensure you reference
ConfigFromConfig and the cfg.SkipMigration access when adding the nil-check and
setting the default SkipMigration value.
In `@util/memo/db/migrate.go`:
- Around line 11-25: The migrate function concatenates tableName into raw SQL
(sqldb.AnsiSQLChange) which risks SQL injection and creates a likely-redundant
index; update migrate so tableName is validated and/or safely quoted before
being used (e.g., allow only a strict regex like ^[A-Za-z0-9_]+$ and then use a
safe identifier-quoting helper or sqldb-provided quoting function if available)
and remove the redundant imemo_namespace_cache index change (the primary key
already covers (namespace, cache_name)); reference the migrate function, the
tableName parameter, sqldb.AnsiSQLChange invocations, and versionTable when
making these changes.
In `@workflow/controller/controller.go`:
- Around line 727-771: In memoizationCacheGarbageCollector, treat cache TTL == 0
as "disable GC" (log and return) instead of falling back to defaultCacheTTL;
preserve the existing negative-TTL disable behavior and only use defaultCacheTTL
when the configured TTL is omitted/unspecified, then ensure cfg :=
memodb.ConfigFromConfig(memoCfg) and queries := memodb.NewQueries(cfg.TableName)
are created once outside the for loop (so queries.Prune is called on the same
queries object each tick) — adjust the ttl logic around variables
defaultCacheTTL and ttl and move the cfg/queries setup outside the loop in the
memoizationCacheGarbageCollector function.
---
Nitpick comments:
In `@docs/memoization.md`:
- Line 106: Reword the sentence about ConfigMap memoization to be more concise:
replace the long sentence with a shorter one that states that using memoization
with the ConfigMap backend requires adding the create and update verbs to the
configmaps resource on the appropriate role (argo-cluster-role for cluster
installs, argo-role for namespace installs), and note that this is unnecessary
for the SQL backend; update the text around the symbols `configmaps`, `create`,
`update`, `argo-cluster-role`, and `argo-role` accordingly.
In `@test/e2e/sqldb_memoize_test.go`:
- Line 142: The test contains a non-idiomatic unused-import workaround `_ =
db.ErrNoMoreRows`; remove the workaround by deleting the import of the `db`
package if it's not used elsewhere in test/e2e/sqldb_memoize_test.go, or if `db`
is intentionally kept for documentation, replace the assignment with a comment
explaining the reasons and ensure no references to `db.ErrNoMoreRows` remain;
locate the import and the `_ = db.ErrNoMoreRows` line and either remove the
import and the assignment or add a clear comment and use a real reference to
`db` in the test to justify keeping the import.
In `@util/memo/db/queries_test.go`:
- Around line 127-136: The 5ms pause in the test may be too short and causes
flakiness; update the sleep in the test (the time.Sleep call used before calling
q.Load to get `second`) to a larger duration (e.g., 50–100ms) or replace it with
a small retry/wait loop that polls until `second.LastHitAt` is strictly after
`first.LastHitAt`; target the time.Sleep invocation and the subsequent calls to
q.Load/LastHitAt in this test so the assertion
assert.True(second.LastHitAt.After(first.LastHitAt), ...) reliably passes.
In `@workflow/controller/cache/cache.go`:
- Around line 78-81: The comment above the cache type constant is stale: update
the comment for the const block (where Type and ConfigMapCache are defined) to
reflect that both ConfigMapCache and SQL/database-backed caching are supported
now instead of saying "Only config maps are currently supported for caching";
edit the comment to accurately describe the supported cache types (e.g., config
maps and SQL/database caches) so it matches the implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2dfc6cab-ef64-4e7d-88b4-524b575c2deb
📒 Files selected for processing (17)
.features/pending/sql-memoization-cache.md.spellingconfig/config.godocs/environment-variables.mddocs/memoization.mddocs/workflow-controller-configmap.mdmanifests/components/postgres/overlays/workflow-controller-configmap.yamlmanifests/quick-start-postgres.yamltest/e2e/sqldb_memoize_test.goutil/memo/db/config.goutil/memo/db/migrate.goutil/memo/db/queries.goutil/memo/db/queries_test.goworkflow/controller/cache/cache.goworkflow/controller/cache/sqldb_cache.goworkflow/controller/config.goworkflow/controller/controller.go
Claude Code reviewFound 18 issues:
argo-workflows/docs/memoization.md Lines 41 to 55 in 0d9402b argo-workflows/util/memo/db/config.go Lines 23 to 36 in 0d9402b
argo-workflows/workflow/controller/controller.go Lines 748 to 756 in 0d9402b
argo-workflows/util/memo/db/queries.go Lines 75 to 88 in 0d9402b
argo-workflows/workflow/controller/config.go Lines 80 to 96 in 0d9402b
argo-workflows/util/memo/db/config.go Lines 58 to 75 in 0d9402b
argo-workflows/util/memo/db/queries.go Lines 88 to 119 in 0d9402b
argo-workflows/util/memo/db/migrate.go Lines 11 to 26 in 0d9402b
argo-workflows/util/memo/db/queries.go Lines 36 to 73 in 0d9402b
argo-workflows/workflow/controller/cache/cache.go Lines 52 to 74 in 0d9402b
argo-workflows/workflow/controller/cache/cache.go Lines 76 to 124 in 0d9402b
argo-workflows/.features/pending/sql-memoization-cache.md Lines 1 to 10 in 0d9402b
argo-workflows/test/e2e/sqldb_memoize_test.go Lines 140 to 144 in 0d9402b
argo-workflows/docs/environment-variables.md Lines 27 to 31 in 0d9402b
argo-workflows/workflow/controller/controller.go Lines 738 to 748 in 0d9402b argo-workflows/docs/memoization.md Lines 56 to 60 in 0d9402b
argo-workflows/util/memo/db/queries.go Lines 87 to 119 in 0d9402b
argo-workflows/util/memo/db/queries.go Lines 75 to 88 in 0d9402b
argo-workflows/util/memo/db/config.go Lines 38 to 56 in 0d9402b
argo-workflows/util/memo/db/migrate.go Lines 13 to 25 in 0d9402b 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
All 18 issues from the review have been addressed in 93b5b19:
|
c9f9395 to
c2d4d16
Compare
e09b1c8 to
66573a9
Compare
|
Some important design questions that we should probably discuss:
|
We've already got that here, haven't we? Well, you can use the same one, or a different one, depending upon how you configure it.
You appear to have implemented multiple features here. It's big enough and AI enough as is without automatic key generation - please don't do that in this PR.
It's way bigger than I'd expect. We already have maxAge, which isn't perfect. Again, split this into a separate feature if you'd like TTL. It should apply to any backend equally and can be reasoned about separately. |
60c89c4 to
bb0f4fb
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Add a SQL database-backed memoization cache alongside the existing ConfigMap-based implementation. This includes the database query and migration layer, controller wiring, auto-derived cache keys based on inputs and executor fingerprinting, per-entry expiration handling, and workflow-namespace scoping for cache entries. The branch also includes the supporting artifact checksum and ETag plumbing, generated API/docs updates, and follow-up fixes for error handling, nil-safety, and CI/documentation issues discovered during review. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
| passwordSecret: | ||
| name: argo-postgres-config | ||
| key: password | ||
| memoization: | |
There was a problem hiding this comment.
Added memoization to quick-start-postgres in order to run the e2e sqldb_memoize_test.go test. Let me know if you prefer for that to happen in a dedicated overlay, e.g. quick-start-postgress-with-caching or something to that effect.
There was a problem hiding this comment.
I'd like to e2e test this on both postgres and mysql. Happy to leave it configured here though, but needs duplicating for mysql.
| passwordSecret: | ||
| name: argo-postgres-config | ||
| key: password | ||
| memoization: | |
There was a problem hiding this comment.
I'd like to e2e test this on both postgres and mysql. Happy to leave it configured here though, but needs duplicating for mysql.
| } | ||
|
|
||
| func (s *SQLDBMemoizeSuite) TestSQLDBMemoize() { | ||
| if s.Config.Memoization == nil { |
There was a problem hiding this comment.
Can we drop this - always test the feature. Dangerous to have the only test for a feature accidentally stop running.
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Manual validation evidenceValidate ConfigMapI deployed AWF on my local orbstack-backed K8s cluster with the changes in this PR. Setup port-forwarding: kubectl -n argo port-forward svc/argo-server 2746:2746I added the I ran the following e2e test: GOFLAGS='-mod=mod' go test ./test/e2e -tags corefunctional -run 'TestFunctionalSuite/TestStepLevelMemoize$' -count=1 -vIt succeeded: I verified that a CM was created: Validate SQL-backed approachNow that I've verified the default, CM-based approach works as expected, I moved on to the SQL-based approach. I enabled the SQL-backed approach: kubectl apply --server-side --force-conflicts \
-f manifests/cluster-install-no-crds/workflow-controller-rbac/workflow-controller-clusterrole.yaml
make install PROFILE=postgres
kubectl -n argo scale deploy/workflow-controller deploy/argo-server --replicas=1
kubectl -n argo set env deploy/argo-server CI_ONLY_DISABLE_ARTIFACT_SERVER_CHECKS=true
kubectl -n argo rollout status deploy/postgres --timeout=300s
kubectl -n argo rollout status deploy/workflow-controller --timeout=300s
kubectl -n argo rollout status deploy/argo-server --timeout=300sI validated that the sql memoization was enabled in the config by running: kubectl -n argo get cm workflow-controller-configmap -o jsonpath='{.data.memoization}'; echo
kubectl -n argo logs deploy/workflow-controller --tail=100 | egrep 'Memoization database configuration enabled|Running memoization db migration'The output was: I port-forwarded the postgres service to local: kubectl -n argo port-forward svc/postgres 5432:5432 --address localhostI then kicked off the same e2e workflow as before: GOFLAGS='-mod=mod' go test ./test/e2e -tags sqldbmemoize -run '^TestSQLDBMemoizeSuite$' -count=1 -vIt succeeded: I ran the following command to verify that the cache key is stored in the postgres database: kubectl -n argo exec svc/postgres -- \
psql -U postgres -d postgres \
-c "SELECT namespace, cache_name, cache_key, created_at, expires_at FROM cache_entries ORDER BY created_at DESC LIMIT 20;"Here's the output: |
|
/retest |
Summary
Add an opt-in SQL database backend for memoization cache storage while keeping the existing ConfigMap backend as the default.
When
workflow-controller-configmapincludesmemoization:, memoization entries are stored in PostgreSQL or MySQL instead of ConfigMaps. When SQL is configured but currently unavailable, the controller now treats memoization as unavailable for that access (cache miss / skip save) instead of silently falling back to ConfigMaps, and retries backend initialization on later accesses.Motivation
ConfigMap-backed memoization is simple, but it inherits Kubernetes object size limits and is awkward for larger or longer-lived caches. This change adds a database-backed path that supports larger payloads, controller-managed expiry/pruning, and namespace-scoped cache records without changing the workflow-side
memoizeAPI.Modifications
Controller config and backend lifecycle
memoization:controller config with the standardDBConfig,tableName, andskipMigrationfieldsgetMemoizationCache()calls if the database was unavailable during config loadnilcache when SQL is configured but unavailable so callers log and treat the operation as a miss / skip-save instead of writing to ConfigMapsSQL storage model
util/memo/dbmigration and query layercache_entries) with:namespacecache_namecache_keynode_idoutputscreated_atexpires_at(namespace, cache_name, cache_key)as the primary keyexpires_atindex for pruning^[A-Za-z0-9_]+$ON CONFLICT) and MySQL upsert (ON DUPLICATE KEY UPDATE)created_atandexpires_atwhen an existing cache entry is replacedtableNamedoes not reuse another table's schema historyCache routing and TTL behavior
memoization:is omittedmemoize.cache.configMap.nameas the logical cache name for both backendscache_namein the databaseMemoizationCache.Save()to acceptmaxAgememoize.maxAge, orDEFAULT_MAX_AGE, or a 30-day default for the SQL backendexpires_atat write timeMEMO_CACHE_GC_PERIOD(default 24h) and prune rows whereexpires_at < nowWorkflow execution paths
operator.go,dag.go,steps.go, andtaskset.gomaxAgeor DB errors still surface as node/controller errorsMermaid diagrams
Backend selection / routing
flowchart TD A[Workflow node uses memoize] --> B[get memoization cache] B --> C{memoization config set} C -->|no| D[use ConfigMap cache] C -->|yes| E[ensure SQL backend] E --> F{SQL backend ready} F -->|yes| G[use SQL cache] F -->|no| H[return nil cache] D --> I[load or save via ConfigMap] G --> J[load or save via SQL] H --> K[treat lookup as cache miss] H --> L[skip cache save]ConfigMap path
sequenceDiagram participant N as Workflow node participant C as WorkflowController participant F as cacheFactory participant M as ConfigMapCache participant K as Kubernetes API N->>C: getMemoizationCache(ns, cacheName) C->>F: GetCache(ConfigMapCache, ns, cacheName) F-->>C: ConfigMapCache alt load C->>M: Load(key) M->>K: GET ConfigMap K-->>M: ConfigMap / NotFound alt entry exists M->>K: UPDATE lastHitTimestamp M-->>C: Entry else miss M-->>C: nil end else save Note over M: maxAge is ignored for ConfigMap storage C->>M: Save(key, nodeID, outputs, maxAge) M->>K: GET or CREATE ConfigMap M->>K: UPDATE entry JSON M-->>C: ok / error endSQL path
sequenceDiagram participant N as Workflow node participant C as WorkflowController participant F as cacheFactory participant S as sqlDBCache participant D as SQL database participant G as GC ticker N->>C: getMemoizationCache(ns, cacheName) C->>C: ensureMemoizationBackend() alt backend not ready yet C->>D: connect and run migration D-->>C: ready or unavailable end alt SQL backend ready C->>F: GetCache(ns, cacheName) F-->>C: sqlDBCache alt load C->>S: Load(key) S->>D: select active entry D-->>S: row or no row S-->>C: entry or nil else save C->>S: Save(key, nodeID, outputs, maxAge) S->>D: upsert row with created_at and expires_at D-->>S: ok S-->>C: ok or error end G->>D: delete expired rows on each GC tick else SQL configured but unavailable C-->>N: nil cache C-->>N: caller treats it as miss or skip-save endCurrent semantics
memoization:is not configuredmemoize.keyandmemoize.cache.configMap.nameremain the workflow-side inputscache.configMap.namebecomes a logical cache group name; no ConfigMap is createdmemoize.maxAgeis omitted, SQL entries still get a default TTL (DEFAULT_MAX_AGEor 30 days)Verification
util/memo/db/queries.go(Postgres and MySQL save/load/upsert/expiry/prune)workflow/controller/cache/sqldb_cache.goworkflow/controller/config.goretry behaviorDocumentation
docs/memoization.mdwith SQL backend usage, TTL behavior, and env varsdocs/environment-variables.mdforDEFAULT_MAX_AGEandMEMO_CACHE_GC_PERIODdocs/workflow-controller-configmap.mdwithMemoizationConfigAI
Claude Code was used in earlier preparation of this PR. GitHub Copilot CLI was used to update this PR description.