feat: SQL database-backed memoization cache by droctothorpe · Pull Request #15938 · argoproj/argo-workflows

droctothorpe · 2026-04-15T17:54:41Z

Summary

Add an opt-in SQL database backend for memoization cache storage while keeping the existing ConfigMap backend as the default.

When workflow-controller-configmap includes memoization:, memoization entries are stored in PostgreSQL or MySQL instead of ConfigMaps. When SQL is configured but currently unavailable, the controller now treats memoization as unavailable for that access (cache miss / skip save) instead of silently falling back to ConfigMaps, and retries backend initialization on later accesses.

Motivation

ConfigMap-backed memoization is simple, but it inherits Kubernetes object size limits and is awkward for larger or longer-lived caches. This change adds a database-backed path that supports larger payloads, controller-managed expiry/pruning, and namespace-scoped cache records without changing the workflow-side memoize API.

Modifications

Controller config and backend lifecycle

Add memoization: controller config with the standard DBConfig, tableName, and skipMigration fields
Initialize and migrate the memoization backend from controller config
Retry backend initialization on later getMemoizationCache() calls if the database was unavailable during config load
Clear or close SQL state when memoization config is removed or migration/connection setup fails
Return nil cache when SQL is configured but unavailable so callers log and treat the operation as a miss / skip-save instead of writing to ConfigMaps

SQL storage model

Add util/memo/db migration and query layer
Create the configured memoization table (default: cache_entries) with:
- namespace
- cache_name
- cache_key
- node_id
- outputs
- created_at
- expires_at
Use (namespace, cache_name, cache_key) as the primary key
Add an expires_at index for pruning
Validate table names against ^[A-Za-z0-9_]+$
Use parameterized queries for all runtime values
Support Postgres upsert (ON CONFLICT) and MySQL upsert (ON DUPLICATE KEY UPDATE)
Refresh created_at and expires_at when an existing cache entry is replaced
Use table-specific migration bookkeeping so changing the memoization tableName does not reuse another table's schema history

Cache routing and TTL behavior

Keep ConfigMap caching as the default when memoization: is omitted
Keep memoize.cache.configMap.name as the logical cache name for both backends
For SQL, no ConfigMap is created; that name is stored as cache_name in the database
Extend MemoizationCache.Save() to accept maxAge
Resolve TTL from memoize.maxAge, or DEFAULT_MAX_AGE, or a 30-day default for the SQL backend
Store SQL expiry as expires_at at write time
Run SQL cache GC on MEMO_CACHE_GC_PERIOD (default 24h) and prune rows where expires_at < now

Workflow execution paths

Wire the new save/load behavior through operator.go, dag.go, steps.go, and taskset.go
Keep load/save error handling explicit: malformed maxAge or DB errors still surface as node/controller errors
Guard memoization call sites so an unavailable SQL backend does not panic the controller

Mermaid diagrams

Backend selection / routing

flowchart TD
  A[Workflow node uses memoize] --> B[get memoization cache]
  B --> C{memoization config set}
  C -->|no| D[use ConfigMap cache]
  C -->|yes| E[ensure SQL backend]
  E --> F{SQL backend ready}
  F -->|yes| G[use SQL cache]
  F -->|no| H[return nil cache]
  D --> I[load or save via ConfigMap]
  G --> J[load or save via SQL]
  H --> K[treat lookup as cache miss]
  H --> L[skip cache save]

ConfigMap path

sequenceDiagram
  participant N as Workflow node
  participant C as WorkflowController
  participant F as cacheFactory
  participant M as ConfigMapCache
  participant K as Kubernetes API

  N->>C: getMemoizationCache(ns, cacheName)
  C->>F: GetCache(ConfigMapCache, ns, cacheName)
  F-->>C: ConfigMapCache

  alt load
    C->>M: Load(key)
    M->>K: GET ConfigMap
    K-->>M: ConfigMap / NotFound
    alt entry exists
      M->>K: UPDATE lastHitTimestamp
      M-->>C: Entry
    else miss
      M-->>C: nil
    end
  else save
    Note over M: maxAge is ignored for ConfigMap storage
    C->>M: Save(key, nodeID, outputs, maxAge)
    M->>K: GET or CREATE ConfigMap
    M->>K: UPDATE entry JSON
    M-->>C: ok / error
  end

SQL path

sequenceDiagram
  participant N as Workflow node
  participant C as WorkflowController
  participant F as cacheFactory
  participant S as sqlDBCache
  participant D as SQL database
  participant G as GC ticker

  N->>C: getMemoizationCache(ns, cacheName)
  C->>C: ensureMemoizationBackend()
  alt backend not ready yet
    C->>D: connect and run migration
    D-->>C: ready or unavailable
  end

  alt SQL backend ready
    C->>F: GetCache(ns, cacheName)
    F-->>C: sqlDBCache

    alt load
      C->>S: Load(key)
      S->>D: select active entry
      D-->>S: row or no row
      S-->>C: entry or nil
    else save
      C->>S: Save(key, nodeID, outputs, maxAge)
      S->>D: upsert row with created_at and expires_at
      D-->>S: ok
      S-->>C: ok or error
    end

    G->>D: delete expired rows on each GC tick
  else SQL configured but unavailable
    C-->>N: nil cache
    C-->>N: caller treats it as miss or skip-save
  end

Current semantics

ConfigMap remains the default backend when memoization: is not configured
memoize.key and memoize.cache.configMap.name remain the workflow-side inputs
On the SQL backend, cache.configMap.name becomes a logical cache group name; no ConfigMap is created
SQL stores expiry at write time. If memoize.maxAge is omitted, SQL entries still get a default TTL (DEFAULT_MAX_AGE or 30 days)

Verification

Added or updated unit coverage for:
- util/memo/db/queries.go (Postgres and MySQL save/load/upsert/expiry/prune)
- workflow/controller/cache/sqldb_cache.go
- workflow/controller/config.go retry behavior
- cache factory namespace scoping
Added corefunctional e2e coverage for SQL memoization:
- first execution saves to the DB
- second execution hits the cache
- no ConfigMap is created when SQL memoization is enabled

Documentation

Update docs/memoization.md with SQL backend usage, TTL behavior, and env vars
Update docs/environment-variables.md for DEFAULT_MAX_AGE and MEMO_CACHE_GC_PERIOD
Update docs/workflow-controller-configmap.md with MemoizationConfig
Add memoization examples to the Postgres controller manifests

AI

Claude Code was used in earlier preparation of this PR. GitHub Copilot CLI was used to update this PR description.

Joibel · 2026-04-16T08:58:07Z

@coderabbitai review

coderabbitai · 2026-04-16T08:58:16Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-04-16T09:12:21Z

📝 Walkthrough

Walkthrough

Introduces SQL database-backed memoization caching for PostgreSQL/MySQL as an alternative to Kubernetes ConfigMaps. Adds configuration structures, database migrations, query operations, cache factory integration, controller initialization, garbage collection scheduling, documentation, manifest examples, and end-to-end tests.

Changes

Cohort / File(s)	Summary
Feature Documentation `.features/pending/sql-memoization-cache.md`, `docs/memoization.md`, `docs/workflow-controller-configmap.md`, `docs/environment-variables.md`	Added documentation describing the SQL backend memoization cache option, removal of ConfigMap 1 MB size limit, cacheTTL control, GC scheduling via `MEMO_CACHE_GC_PERIOD`, and updated RBAC requirements to specify ConfigMap-specific verbs.
Configuration `config/config.go`, `.spelling`	Extended `Config` struct with `Memoization` field; introduced `MemoizationConfig` type embedding `DBConfig` with `TableName`, `SkipMigration`, and `CacheTTL` fields; updated spellcheck dictionary.
Manifest Files `manifests/components/postgres/overlays/workflow-controller-configmap.yaml`, `manifests/quick-start-postgres.yaml`	Added `memoization` configuration blocks specifying PostgreSQL connection parameters, table names, and Kubernetes secret references for credentials in ConfigMap and quick-start manifests.
Memoization Database Module `util/memo/db/config.go`, `util/memo/db/migrate.go`, `util/memo/db/queries.go`, `util/memo/db/queries_test.go`	New database abstraction layer with config helpers, schema migration logic (table creation, index management), and CRUD query operations (`Load`, `Save`, `Prune`) for memoization cache entries; includes integration tests using throwaway PostgreSQL container.
Cache Factory & SQL Backend `workflow/controller/cache/cache.go`, `workflow/controller/cache/sqldb_cache.go`	Extended `Factory` interface with `SetSessionProxy` method; added `sqlDBCache` type implementing memoization cache backed by SQL database with key validation, record marshaling/unmarshaling, and timestamp handling.
Controller Integration `workflow/controller/config.go`, `workflow/controller/controller.go`	Initialized memoization database session proxy and schema migrations in `updateConfig`; added `memoizationCacheGarbageCollector` goroutine to periodically prune expired entries based on `MEMO_CACHE_GC_PERIOD` and configured TTL.
End-to-End Testing `test/e2e/sqldb_memoize_test.go`	New e2e test suite validating SQL-backed memoization behavior: workflow submission with cache hits/saves, database persistence verification via direct DB query, and assertion that no ConfigMap fallback occurs.

Sequence Diagrams

sequenceDiagram
    participant Workflow as Workflow Node
    participant Controller as Controller
    participant Cache as Cache Factory
    participant DB as SQL Database
    
    Workflow->>Controller: Execute node with memoize
    Controller->>Cache: GetCache(name)
    alt Database configured
        Cache->>DB: Load(namespace, cacheName, key)
        DB-->>Cache: CacheRecord or nil
        Cache-->>Controller: Entry or nil
    else No database
        Cache->>Cache: Return ConfigMap cache
    end
    
    alt Cache hit
        Controller-->>Workflow: Return memoized output
    else Cache miss
        Workflow->>Controller: Execute and produce output
        Controller->>Cache: Save(namespace, cacheName, key, outputs)
        Cache->>DB: Save record with outputs
        DB-->>Cache: Confirmed
    end

sequenceDiagram
    participant GC as GC Goroutine
    participant Cache as Queries
    participant DB as SQL Database
    
    loop Every MEMO_CACHE_GC_PERIOD
        GC->>Cache: Prune(maxAge=cacheTTL)
        Cache->>DB: DELETE WHERE last_hit_at < now - maxAge
        DB-->>Cache: Row count deleted
        Cache-->>GC: Rows pruned
        GC->>GC: Log result
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: retry connections on network errors. Fixes #15011 #15006: Introduces and enhances util/sqldb.SessionProxy and connection management logic that is directly consumed by this PR's memoization database initialization.
refactor: thread DBType through DB session creation instead of detecting from driver #15576: Refactors SQL migration and session creation APIs to thread DBType explicitly, which directly impacts this PR's usage of sqldb.Migrate and session proxy creation.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.53% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding a SQL database-backed memoization cache feature, which is the central addition across the codebase.
Description check	✅ Passed	The PR description comprehensively documents the feature, including motivation, modifications, architecture diagrams, verification, and documentation updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (4)

workflow/controller/cache/cache.go (1)

78-81: Stale comment for cache type constant.

The comment "Only config maps are currently supported for caching" is no longer accurate now that SQL database caching is also supported.

📝 Suggested fix

 const (
-	// Only config maps are currently supported for caching
+	// ConfigMapCache is a cache type identifier used as a key prefix in the cache map.
+	// Despite the name, when a session proxy is configured, SQL-backed caching is used instead.
 	ConfigMapCache Type = "ConfigMapCache"
 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@workflow/controller/cache/cache.go` around lines 78 - 81, The comment above
the cache type constant is stale: update the comment for the const block (where
Type and ConfigMapCache are defined) to reflect that both ConfigMapCache and
SQL/database-backed caching are supported now instead of saying "Only config
maps are currently supported for caching"; edit the comment to accurately
describe the supported cache types (e.g., config maps and SQL/database caches)
so it matches the implementation.

docs/memoization.md (1)

106-106: Minor style improvement.

Consider a more concise phrasing.

📝 Suggested fix

-    In order to use memoization with the ConfigMap backend it is necessary to add the verbs `create` and `update` to the `configmaps` resource for the appropriate (cluster) roles.
+    To use memoization with the ConfigMap backend, add the verbs `create` and `update` to the `configmaps` resource for the appropriate (cluster) roles.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/memoization.md` at line 106, Reword the sentence about ConfigMap
memoization to be more concise: replace the long sentence with a shorter one
that states that using memoization with the ConfigMap backend requires adding
the create and update verbs to the configmaps resource on the appropriate role
(argo-cluster-role for cluster installs, argo-role for namespace installs), and
note that this is unnecessary for the SQL backend; update the text around the
symbols `configmaps`, `create`, `update`, `argo-cluster-role`, and `argo-role`
accordingly.

test/e2e/sqldb_memoize_test.go (1)

142-142: Consider removing the unused import workaround.

The _ = db.ErrNoMoreRows statement exists solely to satisfy the import checker. If the db package isn't otherwise needed, consider removing the import entirely. If it's needed for clarity or documentation purposes, this workaround is acceptable but non-idiomatic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/sqldb_memoize_test.go` at line 142, The test contains a
non-idiomatic unused-import workaround `_ = db.ErrNoMoreRows`; remove the
workaround by deleting the import of the `db` package if it's not used elsewhere
in test/e2e/sqldb_memoize_test.go, or if `db` is intentionally kept for
documentation, replace the assignment with a comment explaining the reasons and
ensure no references to `db.ErrNoMoreRows` remain; locate the import and the `_
= db.ErrNoMoreRows` line and either remove the import and the assignment or add
a clear comment and use a real reference to `db` in the test to justify keeping
the import.

util/memo/db/queries_test.go (1)

127-136: Consider increasing sleep duration to reduce potential test flakiness.

The 5ms sleep may not provide sufficient timing separation on slower CI runners or under load. PostgreSQL's timestamp precision is microseconds, but the actual DB round-trip latency could exceed 5ms in some cases, causing the After assertion to fail sporadically.
💡 Suggested improvement
 	// Wait a moment so the second load gets a later timestamp.
-	time.Sleep(5 * time.Millisecond)
+	time.Sleep(50 * time.Millisecond)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@util/memo/db/queries_test.go` around lines 127 - 136, The 5ms pause in the
test may be too short and causes flakiness; update the sleep in the test (the
time.Sleep call used before calling q.Load to get `second`) to a larger duration
(e.g., 50–100ms) or replace it with a small retry/wait loop that polls until
`second.LastHitAt` is strictly after `first.LastHitAt`; target the time.Sleep
invocation and the subsequent calls to q.Load/LastHitAt in this test so the
assertion assert.True(second.LastHitAt.After(first.LastHitAt), ...) reliably
passes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/config.go`:
- Around line 360-372: The doc comment for MemoizationConfig.CacheTTL is
incorrect: current garbage collector logic in memoizationCacheGarbageCollector
treats TTL==0 as "use default (90 days)" and only negative values disable GC;
update the comment on the CacheTTL field (type TTL) to state that 0 means "use
default 90 days" and that a negative value disables GC (or alternatively, if you
prefer behavior from the comment, change memoizationCacheGarbageCollector to
treat 0 as disable and adjust tests). Reference MemoizationConfig, CacheTTL (TTL
type) and memoizationCacheGarbageCollector when making the change.

In `@docs/memoization.md`:
- Around line 57-58: Update the docs text for the cacheTTL behavior to match the
controller: state that cacheTTL defaults to "2160h" (90 days), that setting it
to "0" will use the default TTL, and that only negative values disable automatic
pruning; reference the cacheTTL field name so readers can correlate it with the
controller implementation.

In `@util/memo/db/config.go`:
- Around line 31-36: ConfigFromConfig may dereference a nil cfg via
cfg.SkipMigration; update ConfigFromConfig to guard against nil by checking if
cfg == nil and using a sensible default (e.g., false) for SkipMigration while
still calling TableName(cfg) (which already handles nil). Ensure you reference
ConfigFromConfig and the cfg.SkipMigration access when adding the nil-check and
setting the default SkipMigration value.

In `@util/memo/db/migrate.go`:
- Around line 11-25: The migrate function concatenates tableName into raw SQL
(sqldb.AnsiSQLChange) which risks SQL injection and creates a likely-redundant
index; update migrate so tableName is validated and/or safely quoted before
being used (e.g., allow only a strict regex like ^[A-Za-z0-9_]+$ and then use a
safe identifier-quoting helper or sqldb-provided quoting function if available)
and remove the redundant imemo_namespace_cache index change (the primary key
already covers (namespace, cache_name)); reference the migrate function, the
tableName parameter, sqldb.AnsiSQLChange invocations, and versionTable when
making these changes.

In `@workflow/controller/controller.go`:
- Around line 727-771: In memoizationCacheGarbageCollector, treat cache TTL == 0
as "disable GC" (log and return) instead of falling back to defaultCacheTTL;
preserve the existing negative-TTL disable behavior and only use defaultCacheTTL
when the configured TTL is omitted/unspecified, then ensure cfg :=
memodb.ConfigFromConfig(memoCfg) and queries := memodb.NewQueries(cfg.TableName)
are created once outside the for loop (so queries.Prune is called on the same
queries object each tick) — adjust the ttl logic around variables
defaultCacheTTL and ttl and move the cfg/queries setup outside the loop in the
memoizationCacheGarbageCollector function.

---

Nitpick comments:
In `@docs/memoization.md`:
- Line 106: Reword the sentence about ConfigMap memoization to be more concise:
replace the long sentence with a shorter one that states that using memoization
with the ConfigMap backend requires adding the create and update verbs to the
configmaps resource on the appropriate role (argo-cluster-role for cluster
installs, argo-role for namespace installs), and note that this is unnecessary
for the SQL backend; update the text around the symbols `configmaps`, `create`,
`update`, `argo-cluster-role`, and `argo-role` accordingly.

In `@test/e2e/sqldb_memoize_test.go`:
- Line 142: The test contains a non-idiomatic unused-import workaround `_ =
db.ErrNoMoreRows`; remove the workaround by deleting the import of the `db`
package if it's not used elsewhere in test/e2e/sqldb_memoize_test.go, or if `db`
is intentionally kept for documentation, replace the assignment with a comment
explaining the reasons and ensure no references to `db.ErrNoMoreRows` remain;
locate the import and the `_ = db.ErrNoMoreRows` line and either remove the
import and the assignment or add a clear comment and use a real reference to
`db` in the test to justify keeping the import.

In `@util/memo/db/queries_test.go`:
- Around line 127-136: The 5ms pause in the test may be too short and causes
flakiness; update the sleep in the test (the time.Sleep call used before calling
q.Load to get `second`) to a larger duration (e.g., 50–100ms) or replace it with
a small retry/wait loop that polls until `second.LastHitAt` is strictly after
`first.LastHitAt`; target the time.Sleep invocation and the subsequent calls to
q.Load/LastHitAt in this test so the assertion
assert.True(second.LastHitAt.After(first.LastHitAt), ...) reliably passes.

In `@workflow/controller/cache/cache.go`:
- Around line 78-81: The comment above the cache type constant is stale: update
the comment for the const block (where Type and ConfigMapCache are defined) to
reflect that both ConfigMapCache and SQL/database-backed caching are supported
now instead of saying "Only config maps are currently supported for caching";
edit the comment to accurately describe the supported cache types (e.g., config
maps and SQL/database caches) so it matches the implementation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2dfc6cab-ef64-4e7d-88b4-524b575c2deb

📥 Commits

Reviewing files that changed from the base of the PR and between 51ea54b and 0d9402b.

📒 Files selected for processing (17)

.features/pending/sql-memoization-cache.md
.spelling
config/config.go
docs/environment-variables.md
docs/memoization.md
docs/workflow-controller-configmap.md
manifests/components/postgres/overlays/workflow-controller-configmap.yaml
manifests/quick-start-postgres.yaml
test/e2e/sqldb_memoize_test.go
util/memo/db/config.go
util/memo/db/migrate.go
util/memo/db/queries.go
util/memo/db/queries_test.go
workflow/controller/cache/cache.go
workflow/controller/cache/sqldb_cache.go
workflow/controller/config.go
workflow/controller/controller.go

Joibel · 2026-04-16T09:16:31Z

Claude Code review

Found 18 issues:

tableName in every doc example and overlay manifest is nested inside postgresql:/mysql: blocks, but the code reads only the top-level MemoizationConfig.TableName. The nested value is silently ignored; it only appears to work because the value matches the default.

argo-workflows/docs/memoization.md

Lines 41 to 55 in 0d9402b

    
           data: 
        
             memoization: | 
        
               postgresql: 
        
                 host: postgres 
        
                 port: 5432 
        
                 database: postgres 
        
                 tableName: memoization_cache 
        
                 userNameSecret: 
        
                   name: argo-postgres-config 
        
                   key: username 
        
                 passwordSecret: 
        
                   name: argo-postgres-config 
        
                   key: password 
        
               cacheTTL: "2160h" 
        
           ```

argo-workflows/util/memo/db/config.go

Lines 23 to 36 in 0d9402b

    
           func TableName(cfg *config.MemoizationConfig) string { 
        
           	if cfg == nil || cfg.TableName == "" { 
        
           		return defaultTableName 
        
           	} 
        
           	return cfg.TableName 
        
           } 
        
           func ConfigFromConfig(cfg *config.MemoizationConfig) Config { 
        
           	return Config{ 
        
           		TableName:     TableName(cfg), 
        
           		SkipMigration: cfg.SkipMigration, 
        
           	} 
        
           }

MEMO_CACHE_GC_PERIOD=0s causes time.NewTicker(0) to panic. The sibling ConfigMap GC explicitly guards if cacheGCPeriod != 0; this new goroutine does not.

argo-workflows/workflow/controller/controller.go

Lines 748 to 756 in 0d9402b

    
           } 
        
           periodicity := env.LookupEnvDurationOr(ctx, "MEMO_CACHE_GC_PERIOD", 24*time.Hour) 
        
           logger.WithFields(logging.Fields{"ttl": ttl, "periodicity": periodicity}).Info(ctx, "Starting memoization cache GC") 
        
           ticker := time.NewTicker(periodicity) 
        
           defer ticker.Stop() 
        
           cfg := memodb.ConfigFromConfig(memoCfg) 
        
           queries := memodb.NewQueries(cfg.TableName)

Prune calls sp.Session().SQL().DeleteFrom(...).Exec() directly, bypassing sp.With(); SessionProxy.Session() godoc explicitly warns this path loses reconnect-on-error. Running on a 24h ticker, the connection is very likely to have gone idle — failures then persist until controller restart. Load/Save correctly use TxWith.

argo-workflows/util/memo/db/queries.go

Lines 75 to 88 in 0d9402b

    
           // Prune deletes cache entries whose last_hit_at is older than maxAge. It is called periodically 
        
           // by the controller to bound the size of the memoization_cache table. 
        
           func (q *Queries) Prune(ctx context.Context, sp *sqldb.SessionProxy, maxAge time.Duration) (int64, error) { 
        
           	cutoff := time.Now().Add(-maxAge) 
        
           	result, err := sp.Session().SQL(). 
        
           		DeleteFrom(q.tableName). 
        
           		Where("last_hit_at < ?", cutoff). 
        
           		Exec() 
        
           	if err != nil { 
        
           		return 0, err 
        
           	} 
        
           	n, err := result.RowsAffected() 
        
           	return n, err 
        
           }

When the memoization block is removed from the ConfigMap at runtime, the else branch only logs — it never calls cacheFactory.SetSessionProxy(nil, \"\"), so SQL routing continues until controller restart. Also, a failed initial DB connection leaves memoSessionProxy=nil and the if wfc.memoSessionProxy == nil guard never retries.

argo-workflows/workflow/controller/config.go

Lines 80 to 96 in 0d9402b

    
           } 
        
           if memoCfg := wfc.Config.Memoization; memoCfg != nil { 
        
           	logger.Info(ctx, "Memoization database configuration enabled") 
        
           	if wfc.memoSessionProxy == nil { 
        
           		sp := memodb.SessionProxyFromConfig(ctx, wfc.kubeclientset, wfc.namespace, memoCfg) 
        
           		wfc.memoSessionProxy = sp 
        
           	} 
        
           	if wfc.memoSessionProxy != nil { 
        
           		cfg := memodb.ConfigFromConfig(memoCfg) 
        
           		memodb.Migrate(ctx, wfc.memoSessionProxy, cfg) 
        
           		wfc.cacheFactory.SetSessionProxy(wfc.memoSessionProxy, cfg.TableName) 
        
           	} 
        
           } else { 
        
           	logger.Info(ctx, "Memoization database configuration disabled; using ConfigMap-based caching") 
        
           }

memodb.Migrate logs migration failure as a warning but returns no error, and the caller still proceeds to SetSessionProxy. Compare util/sync/db/config.go which clears SessionProxy on failure, and persistence initDB which is fatal. Subsequent Load/Save will fail silently on every memoized node.

argo-workflows/util/memo/db/config.go

Lines 58 to 75 in 0d9402b

    
           // Migrate runs database migrations for the memoization cache table. It is a no-op when 
        
           // cfg.SkipMigration is true. 
        
           func Migrate(ctx context.Context, sessionProxy *sqldb.SessionProxy, cfg Config) { 
        
           	if sessionProxy == nil { 
        
           		return 
        
           	} 
        
           	logger := logging.RequireLoggerFromContext(ctx) 
        
           	if cfg.SkipMigration { 
        
           		logger.Info(ctx, "Memoization db migration skipped") 
        
           		return 
        
           	} 
        
           	logger.Info(ctx, "Running memoization db migration") 
        
           	if err := migrate(ctx, sessionProxy.Session(), sessionProxy.DBType(), cfg.TableName); err != nil { 
        
           		logger.WithError(err).Warn(ctx, "memoization db migration failed; database-backed memoization will not work") 
        
           	} else { 
        
           		logger.Info(ctx, "Memoization db migration complete") 
        
           	} 
        
           }

outputs stored as raw text via json.Marshal. Postgres rejects \u0000 null bytes in text columns; PR fix: search and replace null bytes for postgres only.Fixes #13711 #14030 fixed exactly this class of bug in persist/sqldb/workflow_archive.go — the same null-byte handling is not applied here. Memoized step outputs can contain null-byte escapes.

argo-workflows/util/memo/db/queries.go

Lines 88 to 119 in 0d9402b

    
           } 
        
           func (q *Queries) Save(ctx context.Context, sp *sqldb.SessionProxy, namespace, cacheName, key, nodeID string, outputs *wfv1.Outputs) error { 
        
           	outputsJSON, err := json.Marshal(outputs) 
        
           	if err != nil { 
        
           		return fmt.Errorf("unable to marshal memoization outputs: %w", err) 
        
           	} 
        
           	now := time.Now() 
        
           	return sp.TxWith(ctx, func(txProxy *sqldb.SessionProxy) error { 
        
           		sess := txProxy.Session() 
        
           		_, err := sess.SQL(). 
        
           			DeleteFrom(q.tableName). 
        
           			Where(db.Cond{ 
        
           				"namespace":  namespace, 
        
           				"cache_name": cacheName, 
        
           				"key":        key, 
        
           			}). 
        
           			Exec() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		_, err = sess.Collection(q.tableName).Insert(&CacheRecord{ 
        
           			Namespace: namespace, 
        
           			CacheName: cacheName, 
        
           			Key:       key, 
        
           			NodeID:    nodeID, 
        
           			Outputs:   string(outputsJSON), 
        
           			CreatedAt: now, 
        
           			LastHitAt: now, 
        
           		}) 
        
           		return err 
        
           	}, nil) 
        
           }

outputs text column is capped at 65,535 bytes on MySQL, contradicting the headline claim of replacing the 1 MB ConfigMap limit. persist/sqldb/migrate.go uses ByType with MySQL-specific JSON/LONGTEXT conversion for exactly this reason; this migration does not.

argo-workflows/util/memo/db/migrate.go

Lines 11 to 26 in 0d9402b

    
           func migrate(ctx context.Context, session db.Session, dbType sqldb.DBType, tableName string) error { 
        
           	return sqldb.Migrate(ctx, session, dbType, versionTable, []sqldb.Change{ 
        
           		sqldb.AnsiSQLChange(`create table if not exists ` + tableName + ` ( 
        
               namespace  varchar(256) not null, 
        
               cache_name varchar(256) not null, 
        
               key        varchar(256) not null, 
        
               node_id    text         not null, 
        
               outputs    text         not null, 
        
               created_at timestamp    not null, 
        
               last_hit_at timestamp   not null, 
        
               primary key (namespace, cache_name, key) 
        
           )`), 
        
           		sqldb.AnsiSQLChange(`create index if not exists imemo_namespace_cache on ` + tableName + ` (namespace, cache_name)`), 
        
           		sqldb.AnsiSQLChange(`create index if not exists imemo_last_hit_at on ` + tableName + ` (last_hit_at)`), 
        
           	}) 
        
           }

Every Load (cache hit) opens a transaction and issues SELECT + UPDATE to refresh last_hit_at. With the stated scale motivation ("thousands of memoized nodes"), every cache read becomes a write with transaction overhead — consider coalescing/debouncing last_hit_at updates.

argo-workflows/util/memo/db/queries.go

Lines 36 to 73 in 0d9402b

    
           // Load retrieves the outputs for the given key and updates last_hit_at. Returns nil when the entry 
        
           // does not exist. 
        
           func (q *Queries) Load(ctx context.Context, sp *sqldb.SessionProxy, namespace, cacheName, key string) (*CacheRecord, error) { 
        
           	var record *CacheRecord 
        
           	err := sp.TxWith(ctx, func(txProxy *sqldb.SessionProxy) error { 
        
           		sess := txProxy.Session() 
        
           		var r CacheRecord 
        
           		err := sess.Collection(q.tableName). 
        
           			Find(db.Cond{ 
        
           				"namespace":  namespace, 
        
           				"cache_name": cacheName, 
        
           				"key":        key, 
        
           			}). 
        
           			One(&r) 
        
           		if errors.Is(err, db.ErrNoMoreRows) { 
        
           			return nil 
        
           		} 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		r.LastHitAt = time.Now() 
        
           		_, err = sess.SQL(). 
        
           			Update(q.tableName). 
        
           			Set("last_hit_at", r.LastHitAt). 
        
           			Where(db.Cond{ 
        
           				"namespace":  namespace, 
        
           				"cache_name": cacheName, 
        
           				"key":        key, 
        
           			}). 
        
           			Exec() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		record = &r 
        
           		return nil 
        
           	}, nil) 
        
           	return record, err 
        
           }

cf.namespace is the controller's install namespace (from NewCacheFactory(kubeclient, ns)), not the workflow's namespace. Every row's namespace column has the same value, so the column is non-partitioning and the (namespace, cache_name, key) PK misleads about multi-tenant isolation.

argo-workflows/workflow/controller/cache/cache.go

Lines 52 to 74 in 0d9402b

    
           type cacheFactory struct { 
        
           	caches       map[string]MemoizationCache 
        
           	kubeclient   kubernetes.Interface 
        
           	namespace    string 
        
           	lock         sync.RWMutex 
        
           	sessionProxy *sqldb.SessionProxy 
        
           	tableName    string 
        
           } 
        
           type Factory interface { 
        
           	GetCache(ct Type, name string) MemoizationCache 
        
           	// SetSessionProxy configures the factory to use database-backed caching. Calling this clears 
        
           	// any previously created cache instances so they are recreated against the new backend. 
        
           	SetSessionProxy(sp *sqldb.SessionProxy, tableName string) 
        
           } 
        
           func NewCacheFactory(ki kubernetes.Interface, ns string) Factory { 
        
           	return &cacheFactory{ 
        
           		caches:     make(map[string]MemoizationCache), 
        
           		kubeclient: ki, 
        
           		namespace:  ns, 
        
           	} 
        
           }

Stale comment // Only config maps are currently supported for caching directly above ConfigMapCache. After this PR, GetCache(ConfigMapCache, name) routes to SQL when sessionProxy != nil; the Type enum is functionally dead and the comment is false.

argo-workflows/workflow/controller/cache/cache.go

Lines 76 to 124 in 0d9402b

    
           type Type string 
        
           const ( 
        
           	// Only config maps are currently supported for caching 
        
           	ConfigMapCache Type = "ConfigMapCache" 
        
           ) 
        
           // SetSessionProxy configures the factory to use a database session for new caches, clearing any 
        
           // previously cached instances. 
        
           func (cf *cacheFactory) SetSessionProxy(sp *sqldb.SessionProxy, tableName string) { 
        
           	cf.lock.Lock() 
        
           	defer cf.lock.Unlock() 
        
           	cf.sessionProxy = sp 
        
           	cf.tableName = tableName 
        
           	cf.caches = make(map[string]MemoizationCache) 
        
           } 
        
           // Returns a cache if it exists and creates it otherwise 
        
           func (cf *cacheFactory) GetCache(ct Type, name string) MemoizationCache { 
        
           	cf.lock.RLock() 
        
           	idx := string(ct) + "." + name 
        
           	if c := cf.caches[idx]; c != nil { 
        
           		cf.lock.RUnlock() 
        
           		return c 
        
           	} 
        
           	cf.lock.RUnlock() 
        
           	cf.lock.Lock() 
        
           	defer cf.lock.Unlock() 
        
           	if c := cf.caches[idx]; c != nil { 
        
           		return c 
        
           	} 
        
           	switch ct { 
        
           	case ConfigMapCache: 
        
           		var c MemoizationCache 
        
           		if cf.sessionProxy != nil { 
        
           			c = newSQLDBCache(cf.namespace, name, cf.sessionProxy, cf.tableName) 
        
           		} else { 
        
           			c = NewConfigMapCache(cf.namespace, cf.kubeclient, name) 
        
           		} 
        
           		cf.caches[idx] = c 
        
           		return c 
        
           	default: 
        
           		return nil 
        
           	} 
        
           }

.features/pending/sql-memoization-cache.md has Issues: 15938 — the PR number itself. hack/featuregen generates a link to /issues/15938 which GitHub resolves to the PR, not a feature-request issue. Convention in .features/released/** is to reference the originating issue.

argo-workflows/.features/pending/sql-memoization-cache.md

Lines 1 to 10 in 0d9402b

    
           Description: Add SQL database-backed memoization cache as an alternative to ConfigMaps. 
        
           Authors: [droctothorpe](https://github.com/droctothorpe) 
        
           Component: General 
        
           Issues: 15938 
        
           Memoization can now store cache entries in a PostgreSQL or MySQL database instead of Kubernetes ConfigMaps. 
        
           The SQL backend removes the 1 MB ConfigMap size limit and persists cache entries across cluster restarts. 
        
           ConfigMaps remain the default; opt in by adding a `memoization` block to the `workflow-controller-configmap`. 
        
           A `cacheTTL` field controls how long unaccessed entries are retained (default: `2160h` / 90 days). 
        
           No changes to workflow specs are required.

_ = db.ErrNoMoreRows // keep import used is dead code; db from upper/db/v4 is otherwise unused in the e2e test and the import can simply be removed.

argo-workflows/test/e2e/sqldb_memoize_test.go

Lines 140 to 144 in 0d9402b

    
           	s.NotEmpty(outputs) 
        
           	_ = db.ErrNoMoreRows // keep import used 
        
           }

MEMO_CACHE_GC_PERIOD is out of alphabetical order in the Controller env-var table (placed between CACHE_GC_AFTER_NOT_HIT_DURATION and CRON_SYNC_PERIOD instead of after MAX_OPERATION_TIME).

argo-workflows/docs/environment-variables.md

Lines 27 to 31 in 0d9402b

    
           | `CACHE_GC_PERIOD`                        | `time.Duration`     | `0s`                                                                                        | How often to perform memoization cache GC, which is disabled by default and can be enabled by providing a non-zero duration.                                                                                                                                             | 
        
           | `CACHE_GC_AFTER_NOT_HIT_DURATION`        | `time.Duration`     | `30s`                                                                                       | When a memoization cache has not been hit after this duration, it will be deleted.                                                                                                                                                                                       | 
        
           | `MEMO_CACHE_GC_PERIOD`                   | `time.Duration`     | `24h`                                                                                       | How often the SQL-backed memoization cache garbage collector runs to prune entries that have exceeded their TTL.                                                                                                                                                         | 
        
           | `CRON_SYNC_PERIOD`                       | `time.Duration`     | `10s`                                                                                       | How often to sync cron workflows.                                                                                                                                                                                                                                        | 
        
           | `DEFAULT_REQUEUE_TIME`                   | `time.Duration`     | `10s`                                                                                       | The re-queue time for the rate limiter of the workflow queue.                                                                                                                                                                                                            |

Docs say cacheTTL: \"0\" disables GC, but code treats ttl == 0 as "use default 90 days" — only ttl < 0 disables. Users following the documented path get 90-day pruning instead of disabled GC.

argo-workflows/workflow/controller/controller.go

Lines 738 to 748 in 0d9402b

    
           } 
        
           const defaultCacheTTL = 90 * 24 * time.Hour // 3 months 
        
           ttl := time.Duration(memoCfg.CacheTTL) 
        
           if ttl == 0 { 
        
           	ttl = defaultCacheTTL 
        
           } 
        
           if ttl < 0 { 
        
           	logger.Info(ctx, "Memoization cache TTL is negative - cache GC disabled") 
        
           	return 
        
           }

argo-workflows/docs/memoization.md

Lines 56 to 60 in 0d9402b

    
           The `cacheTTL` field controls how long cache entries that have not been accessed are retained. 
        
           It defaults to `"2160h"` (90 days). Set to `"0"` to disable automatic pruning. 
        
           !!! Note

Save does DELETE-then-INSERT inside a transaction instead of UPSERT (ON CONFLICT DO UPDATE / ON DUPLICATE KEY UPDATE). Two concurrent Saves to the same (ns, cache_name, key) can both DELETE (seeing empty snapshot) and both INSERT, producing a PK-violation on the second.

argo-workflows/util/memo/db/queries.go

Lines 87 to 119 in 0d9402b

    
           	return n, err 
        
           } 
        
           func (q *Queries) Save(ctx context.Context, sp *sqldb.SessionProxy, namespace, cacheName, key, nodeID string, outputs *wfv1.Outputs) error { 
        
           	outputsJSON, err := json.Marshal(outputs) 
        
           	if err != nil { 
        
           		return fmt.Errorf("unable to marshal memoization outputs: %w", err) 
        
           	} 
        
           	now := time.Now() 
        
           	return sp.TxWith(ctx, func(txProxy *sqldb.SessionProxy) error { 
        
           		sess := txProxy.Session() 
        
           		_, err := sess.SQL(). 
        
           			DeleteFrom(q.tableName). 
        
           			Where(db.Cond{ 
        
           				"namespace":  namespace, 
        
           				"cache_name": cacheName, 
        
           				"key":        key, 
        
           			}). 
        
           			Exec() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		_, err = sess.Collection(q.tableName).Insert(&CacheRecord{ 
        
           			Namespace: namespace, 
        
           			CacheName: cacheName, 
        
           			Key:       key, 
        
           			NodeID:    nodeID, 
        
           			Outputs:   string(outputsJSON), 
        
           			CreatedAt: now, 
        
           			LastHitAt: now, 
        
           		}) 
        
           		return err 
        
           	}, nil) 
        
           }

Prune accepts ctx context.Context but never propagates it — .Exec() is called without context binding, so controller shutdown cannot cancel a long DELETE. Load/Save plumb ctx via TxWith.

argo-workflows/util/memo/db/queries.go

Lines 75 to 88 in 0d9402b

    
           // Prune deletes cache entries whose last_hit_at is older than maxAge. It is called periodically 
        
           // by the controller to bound the size of the memoization_cache table. 
        
           func (q *Queries) Prune(ctx context.Context, sp *sqldb.SessionProxy, maxAge time.Duration) (int64, error) { 
        
           	cutoff := time.Now().Add(-maxAge) 
        
           	result, err := sp.Session().SQL(). 
        
           		DeleteFrom(q.tableName). 
        
           		Where("last_hit_at < ?", cutoff). 
        
           		Exec() 
        
           	if err != nil { 
        
           		return 0, err 
        
           	} 
        
           	n, err := result.RowsAffected() 
        
           	return n, err 
        
           }

SessionProxyFromConfig omits sqldb.ConfigureDBSession, so MemoizationConfig.DBConfig.ConnectionPool is silently ignored — connection pool settings have no effect. Persistence calls ConfigureDBSession in workflow/controller/config.go for exactly this.

argo-workflows/util/memo/db/config.go

Lines 38 to 56 in 0d9402b

    
           // SessionProxyFromConfig creates a SessionProxy from a MemoizationConfig, returning nil and logging 
        
           // an error if the connection cannot be established. Callers that receive nil should fall back to 
        
           // ConfigMap-based caching. 
        
           func SessionProxyFromConfig(ctx context.Context, kubectlConfig kubernetes.Interface, namespace string, cfg *config.MemoizationConfig) *sqldb.SessionProxy { 
        
           	if cfg == nil { 
        
           		return nil 
        
           	} 
        
           	sessionProxy, err := sqldb.NewSessionProxy(ctx, sqldb.SessionProxyConfig{ 
        
           		KubectlConfig: kubectlConfig, 
        
           		Namespace:     namespace, 
        
           		DBConfig:      cfg.DBConfig, 
        
           	}) 
        
           	if err != nil { 
        
           		log := logging.RequireLoggerFromContext(ctx) 
        
           		log.WithError(err).Error(ctx, "unable to create memoization database connection") 
        
           		return nil 
        
           	} 
        
           	return sessionProxy 
        
           }

Column named key (a MySQL reserved word) and raw-SQL test queries using WHERE key = ? without backticks will fail on MySQL.

argo-workflows/util/memo/db/migrate.go

Lines 13 to 25 in 0d9402b

    
           		sqldb.AnsiSQLChange(`create table if not exists ` + tableName + ` ( 
        
               namespace  varchar(256) not null, 
        
               cache_name varchar(256) not null, 
        
               key        varchar(256) not null, 
        
               node_id    text         not null, 
        
               outputs    text         not null, 
        
               created_at timestamp    not null, 
        
               last_hit_at timestamp   not null, 
        
               primary key (namespace, cache_name, key) 
        
           )`), 
        
           		sqldb.AnsiSQLChange(`create index if not exists imemo_namespace_cache on ` + tableName + ` (namespace, cache_name)`), 
        
           		sqldb.AnsiSQLChange(`create index if not exists imemo_last_hit_at on ` + tableName + ` (last_hit_at)`), 
        
           	})

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

droctothorpe · 2026-04-17T00:04:29Z

All 18 issues from the review have been addressed in 93b5b19:

Issue	Solution
1. `tableName` nested inside `postgresql:`/`mysql:` blocks is silently ignored; code reads top-level `MemoizationConfig.TableName`	Moved `tableName` to top-level `memoization:` block in docs, overlay manifest, and quick-start manifest
2. `MEMO_CACHE_GC_PERIOD=0s` causes `time.NewTicker(0)` panic	Added `periodicity <= 0` guard before `NewTicker`
3. `Prune` calls `sp.Session()` directly, bypassing reconnect-on-error	Changed `Prune` to use `sp.With(ctx, ...)`
4. Removing `memoization` config block at runtime doesn't clear SQL routing	Added `else` branch to call `SetSessionProxy(nil, "")` and nil out `memoSessionProxy`
5. `Migrate` swallows errors; caller proceeds to `SetSessionProxy`	Changed `Migrate` to return `error`; caller falls back to ConfigMap on failure
6. `outputs` stored as raw text without null-byte handling breaks Postgres	Added `postgresNullReplacement` handling in `Save`/`Load` (same pattern as `workflow_archive.go`)
7. MySQL `text` column is capped at 64KB	Used `sqldb.ByType` to split migration: `text` for Postgres, `longtext` for MySQL
8. Every `Load` opens a transaction with SELECT + UPDATE for `last_hit_at`	Debounced: `last_hit_at` is only updated if older than 1 hour; reads are now transaction-free
9. `namespace` column is always the controller install namespace, not the workflow namespace	Added clarifying comment on the `namespace` field documenting this behavior
10. Stale comment "Only config maps are currently supported for caching"	Updated comment to reflect SQL-backed caching support
11. `.features/pending/` file has `Issues: 15938` (the PR number, not a feature-request issue)	Changed to `PRs: 15938`
12. `_ = db.ErrNoMoreRows` dead code keeping unused import alive	Removed import and workaround line
13. `MEMO_CACHE_GC_PERIOD` out of alphabetical order in env-var docs table	Moved to after `MAX_OPERATION_TIME`
14. Docs say `cacheTTL: "0"` disables GC, but code treats 0 as default 90 days	Fixed all docs: `"0"` uses the default; negative value disables GC
15. `Save` DELETE+INSERT race: concurrent Saves can both DELETE then both INSERT, causing PK violation	Replaced with `ON CONFLICT DO UPDATE` (Postgres) / `ON DUPLICATE KEY UPDATE` (MySQL)
16. `Prune` doesn't propagate context; shutdown can't cancel long DELETE	Now uses `sp.With(ctx, ...)` which propagates context
17. `SessionProxyFromConfig` omits `ConfigureDBSession`; connection pool settings silently ignored	Added `sqldb.ConfigureDBSession` call after creating the session proxy
18. Column `key` is a MySQL reserved word; raw SQL queries without quoting fail on MySQL	Used backtick-quoting for MySQL, double-quote for Postgres in migration DDL and upsert SQL

droctothorpe · 2026-04-19T11:51:34Z

Some important design questions that we should probably discuss:

We see significant pressure on the pod used for workflow archives as workflow quantity and complexity increases. Does it make sense to provide a standalone database for the cache to decouple scaling considerations and reduce the blast radius?
How do you feel about the blank memoize: {} interface? The goal is to make it as turnkey as possible for end users ofc, but maybe there's a better way to design the UX.
How do you feel about 3 months as the default TTL for cache entries?

Joibel · 2026-04-20T14:20:09Z

* We see significant pressure on the pod used for workflow archives as workflow quantity and complexity increases. Does it make sense to provide a standalone database for the cache to decouple scaling considerations and reduce the blast radius?

We've already got that here, haven't we? Well, you can use the same one, or a different one, depending upon how you configure it.

* How do you feel about the blank `memoize: {}` interface? The goal is to make it as turnkey as possible for end users ofc, but maybe there's a better way to design the UX.

You appear to have implemented multiple features here. It's big enough and AI enough as is without automatic key generation - please don't do that in this PR.

* How do you feel about 3 months as the default TTL for cache entries?

It's way bigger than I'd expect. We already have maxAge, which isn't perfect. Again, split this into a separate feature if you'd like TTL. It should apply to any backend equally and can be reasoned about separately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Add a SQL database-backed memoization cache alongside the existing ConfigMap-based implementation. This includes the database query and migration layer, controller wiring, auto-derived cache keys based on inputs and executor fingerprinting, per-entry expiration handling, and workflow-namespace scoping for cache entries. The branch also includes the supporting artifact checksum and ETag plumbing, generated API/docs updates, and follow-up fixes for error handling, nil-safety, and CI/documentation issues discovered during review. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

droctothorpe · 2026-05-14T02:03:04Z

      passwordSecret:
        name: argo-postgres-config
        key: password
+  memoization: |


Added memoization to quick-start-postgres in order to run the e2e sqldb_memoize_test.go test. Let me know if you prefer for that to happen in a dedicated overlay, e.g. quick-start-postgress-with-caching or something to that effect.

I'd like to e2e test this on both postgres and mysql. Happy to leave it configured here though, but needs duplicating for mysql.

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Joibel · 2026-05-14T08:36:01Z

      passwordSecret:
        name: argo-postgres-config
        key: password
+  memoization: |


I'd like to e2e test this on both postgres and mysql. Happy to leave it configured here though, but needs duplicating for mysql.

Joibel · 2026-05-14T08:36:49Z

+}
+
+func (s *SQLDBMemoizeSuite) TestSQLDBMemoize() {
+	if s.Config.Memoization == nil {


Can we drop this - always test the feature. Dangerous to have the only test for a feature accidentally stop running.

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

droctothorpe · 2026-05-15T02:35:41Z

Manual validation evidence

Validate ConfigMap

I deployed AWF on my local orbstack-backed K8s cluster with the changes in this PR.

Setup port-forwarding:

kubectl -n argo port-forward svc/argo-server 2746:2746

I added the create and update verbs to configmap resource in the argo-cluster-role configmap.

I ran the following e2e test:

GOFLAGS='-mod=mod' go test ./test/e2e -tags corefunctional -run 'TestFunctionalSuite/TestStepLevelMemoize$' -count=1 -v

It succeeded:

...
=== RUN   TestFunctionalSuite
=== RUN   TestFunctionalSuite/TestStepLevelMemoize
Submitting workflow  steps-memoize-
Waiting up to 1m0s for workflow with field selector 'metadata.name=steps-memoize-cxvds' and label selector 'workflows.argoproj.io/test'
 ? steps-memoize-cxvds Workflow 0s      

 ● steps-memoize-cxvds   Workflow  0s      
 └ ● steps-memoize-cxvds Steps     0s      
 └ ◷ hello1              Steps     0s      
 └ ● [0]                 StepGroup 0s      
 └ ◷ cache               Pod       0s      
 └ ● [0]                 StepGroup 0s      

 ✔ steps-memoize-cxvds   Workflow  12s     
 └ ✔ steps-memoize-cxvds Steps     12s     
 └ ✔ hello1              Steps     12s     
 └ ✔ [0]                 StepGroup 12s     
 └ ✔ cache               Pod       7s      
 └ ✔ [0]                 StepGroup 12s     
 └ ✔ hello2a             Steps     0s      
 └ ✔ [1]                 StepGroup 0s      

Condition "to be Succeeded" met after 11s
Checking expectation steps-memoize-cxvds
steps-memoize-cxvds : Succeeded 
=== PASS: FunctionalSuite/TestStepLevelMemoize
--- PASS: TestFunctionalSuite (12.24s)
    --- PASS: TestFunctionalSuite/TestStepLevelMemoize (11.95s)
PASS
ok      github.com/argoproj/argo-workflows/v4/test/e2e  14.556s

I verified that a CM was created:

Name:         my-config-memo-step
Namespace:    argo
Labels:       workflows.argoproj.io/configmap-type=Cache
Annotations:  <none>

Data
====
hello1:
----
{"nodeID":"steps-memoize-cxvds-1782520801","outputs":{"parameters":[{"name":"output","value":"hello1"}]},"creationTimestamp":"2026-05-15T02:07:24Z","lastHitTimestamp":"2026-05-15T02:07:24Z"}


BinaryData
====

Events:  <none>

Validate SQL-backed approach

Now that I've verified the default, CM-based approach works as expected, I moved on to the SQL-based approach.

I enabled the SQL-backed approach:

kubectl apply --server-side --force-conflicts \
-f manifests/cluster-install-no-crds/workflow-controller-rbac/workflow-controller-clusterrole.yaml

make install PROFILE=postgres

kubectl -n argo scale deploy/workflow-controller deploy/argo-server --replicas=1
kubectl -n argo set env deploy/argo-server CI_ONLY_DISABLE_ARTIFACT_SERVER_CHECKS=true

kubectl -n argo rollout status deploy/postgres --timeout=300s
kubectl -n argo rollout status deploy/workflow-controller --timeout=300s
kubectl -n argo rollout status deploy/argo-server --timeout=300s

I validated that the sql memoization was enabled in the config by running:

kubectl -n argo get cm workflow-controller-configmap -o jsonpath='{.data.memoization}'; echo
kubectl -n argo logs deploy/workflow-controller --tail=100 | egrep 'Memoization database configuration enabled|Running memoization db migration'

The output was:

❯ kubectl -n argo get cm workflow-controller-configmap -o jsonpath='{.data.memoization}'; echo

tableName: cache_entries
postgresql:
  host: postgres
  port: 5432
  database: postgres
  userNameSecret:
    name: argo-postgres-config
    key: username
  passwordSecret:
    name: argo-postgres-config
    key: password


❯ kubectl -n argo logs deploy/workflow-controller --tail=100 | egrep 'Memoization database configuration enabled|Running memoization db migration'
time=2026-05-15T02:19:06.924Z level=INFO msg="Memoization database configuration enabled" component=configmap_watcher
time=2026-05-15T02:19:06.924Z level=INFO msg="Running memoization db migration" component=configmap_watcher

I port-forwarded the postgres service to local:

kubectl -n argo port-forward svc/postgres 5432:5432 --address localhost

I then kicked off the same e2e workflow as before:

GOFLAGS='-mod=mod' go test ./test/e2e -tags sqldbmemoize -run '^TestSQLDBMemoizeSuite$' -count=1 -v

It succeeded:

❯ argo-workflows main* ⇡ GOFLAGS='-mod=mod' go test ./test/e2e -tags sqldbmemoize -run '^TestSQLDBMemoizeSuite$' -count=1 -v
=== RUN   TestSQLDBMemoizeSuite
=== RUN   TestSQLDBMemoizeSuite/TestSQLDBMemoize
Submitting workflow  sqldb-memoize-
Waiting up to 1m0s for workflow with field selector 'metadata.name=sqldb-memoize-cjrfk' and label selector 'workflows.argoproj.io/test'
 ? sqldb-memoize-cjrfk Workflow 0s      

 ● sqldb-memoize-cjrfk   Workflow  0s      
 └ ● sqldb-memoize-cjrfk Steps     0s      
 └ ◷ run1                Pod       0s      
 └ ● [0]                 StepGroup 0s      

 ✔ sqldb-memoize-cjrfk   Workflow  10s     
 └ ✔ sqldb-memoize-cjrfk Steps     10s     
 └ ✔ run1                Pod       6s      
 └ ✔ [0]                 StepGroup 10s     
 └ ✔ [1]                 StepGroup 0s      
 └ ✔ run2                Pod       0s      

 ✔ sqldb-memoize-cjrfk   Workflow  10s     
 └ ✔ sqldb-memoize-cjrfk Steps     10s     
 └ ✔ run1                Pod       6s      
 └ ✔ [0]                 StepGroup 10s     
 └ ✔ [1]                 StepGroup 0s      
 └ ✔ run2                Pod       0s      

Condition "to be Succeeded" met after 10s
Checking expectation sqldb-memoize-cjrfk
sqldb-memoize-cjrfk : Succeeded 
=== PASS: SQLDBMemoizeSuite/TestSQLDBMemoize
--- PASS: TestSQLDBMemoizeSuite (12.18s)
    --- PASS: TestSQLDBMemoizeSuite/TestSQLDBMemoize (12.02s)
PASS
ok      github.com/argoproj/argo-workflows/v4/test/e2e  12.817s

I ran the following command to verify that the cache key is stored in the postgres database:

kubectl -n argo exec svc/postgres -- \
   psql -U postgres -d postgres \
   -c "SELECT namespace, cache_name, cache_key, created_at, expires_at FROM cache_entries ORDER BY created_at DESC LIMIT 20;"

Here's the output:

 namespace |    cache_name    |            cache_key            |        created_at         |        expires_at         
-----------+------------------+---------------------------------+---------------------------+---------------------------
 argo      | sqldb-memo-cache | hello-sqldb-1778812065286276000 | 2026-05-15 02:27:55.93877 | 2026-05-15 02:37:55.93877
(1 row)

droctothorpe · 2026-05-15T02:36:23Z

/retest

droctothorpe force-pushed the main branch 6 times, most recently from 9022071 to 7674c7f Compare April 15, 2026 23:35

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread config/config.go

Comment thread docs/memoization.md Outdated

Comment thread util/memo/db/config.go

Comment thread util/memo/db/migrate.go

Comment thread workflow/controller/controller.go

droctothorpe force-pushed the main branch 4 times, most recently from c9f9395 to c2d4d16 Compare April 19, 2026 04:56

droctothorpe requested review from Joibel and terrytangyuan as code owners April 19, 2026 04:56

droctothorpe force-pushed the main branch 2 times, most recently from e09b1c8 to 66573a9 Compare April 19, 2026 11:47

Joibel requested changes Apr 20, 2026

View reviewed changes

Comment thread docs/workflow-controller-configmap.md Outdated

droctothorpe force-pushed the main branch 3 times, most recently from 60c89c4 to bb0f4fb Compare April 23, 2026 15:30

droctothorpe added a commit to droctothorpe/argo-workflows that referenced this pull request Apr 24, 2026

test: fix PR argoproj#15938 CI failures

436d20c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

droctothorpe added a commit to droctothorpe/argo-workflows that referenced this pull request Apr 26, 2026

test: fix PR argoproj#15938 CI failures

28d4e58

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

droctothorpe force-pushed the main branch from 963dd31 to 6ce4d83 Compare April 26, 2026 02:10

droctothorpe added a commit to droctothorpe/argo-workflows that referenced this pull request Apr 27, 2026

test: fix PR argoproj#15938 CI failures

175dbfe

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

droctothorpe force-pushed the main branch from 62f337b to ec8eca7 Compare April 27, 2026 13:26

droctothorpe and others added 11 commits May 13, 2026 21:49

WIP

4be56f2

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

test: fix PR argoproj#15938 CI failures

c2b52ca

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

test: avoid Docker Hub rate limits in API e2e

81814e1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Address Claude Code feedback

d0b7bf4

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

docs: sync memoization config reference

650f534

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Revert e2e test flakiness fixes

ae634ec

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Switch to ORM

2514f1c

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Align with workflow archive

ceed9ae

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Handle duplicate keys

56877ee

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Address GPT 5.4 feedback

b0658c0

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

droctothorpe force-pushed the main branch from 88c42aa to b0658c0 Compare May 14, 2026 01:50

droctothorpe commented May 14, 2026

View reviewed changes

Cleanup

ce83cde

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Joibel reviewed May 14, 2026

View reviewed changes

droctothorpe added 2 commits May 14, 2026 08:55

test: enable sql-backed caching for mysql overlay and e2e tests

492bf01

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

test: make sure sql cache tests always run

78a6fdd

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

Conversation

droctothorpe commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Modifications

Controller config and backend lifecycle

SQL storage model

Cache routing and TTL behavior

Workflow execution paths

Mermaid diagrams

Backend selection / routing

ConfigMap path

SQL path

Current semantics

Verification

Documentation

AI

Uh oh!

Joibel commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Joibel commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code review

Uh oh!

droctothorpe commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

droctothorpe commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Joibel commented Apr 20, 2026

Uh oh!

droctothorpe May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Joibel May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Joibel May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Joibel May 14, 2026

Choose a reason for hiding this comment

Uh oh!

droctothorpe commented May 15, 2026

Manual validation evidence

Validate ConfigMap

Validate SQL-backed approach

Uh oh!

droctothorpe commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

droctothorpe commented Apr 15, 2026 •

edited

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Joibel commented Apr 16, 2026 •

edited

Loading

droctothorpe commented Apr 17, 2026 •

edited

Loading

droctothorpe commented Apr 19, 2026 •

edited

Loading