Skip to content

feat: SQL database-backed memoization cache#15938

Open
droctothorpe wants to merge 14 commits into
argoproj:mainfrom
droctothorpe:main
Open

feat: SQL database-backed memoization cache#15938
droctothorpe wants to merge 14 commits into
argoproj:mainfrom
droctothorpe:main

Conversation

@droctothorpe
Copy link
Copy Markdown
Contributor

@droctothorpe droctothorpe commented Apr 15, 2026

Summary

Add an opt-in SQL database backend for memoization cache storage while keeping the existing ConfigMap backend as the default.

When workflow-controller-configmap includes memoization:, memoization entries are stored in PostgreSQL or MySQL instead of ConfigMaps. When SQL is configured but currently unavailable, the controller now treats memoization as unavailable for that access (cache miss / skip save) instead of silently falling back to ConfigMaps, and retries backend initialization on later accesses.

Motivation

ConfigMap-backed memoization is simple, but it inherits Kubernetes object size limits and is awkward for larger or longer-lived caches. This change adds a database-backed path that supports larger payloads, controller-managed expiry/pruning, and namespace-scoped cache records without changing the workflow-side memoize API.

Modifications

Controller config and backend lifecycle

  • Add memoization: controller config with the standard DBConfig, tableName, and skipMigration fields
  • Initialize and migrate the memoization backend from controller config
  • Retry backend initialization on later getMemoizationCache() calls if the database was unavailable during config load
  • Clear or close SQL state when memoization config is removed or migration/connection setup fails
  • Return nil cache when SQL is configured but unavailable so callers log and treat the operation as a miss / skip-save instead of writing to ConfigMaps

SQL storage model

  • Add util/memo/db migration and query layer
  • Create the configured memoization table (default: cache_entries) with:
    • namespace
    • cache_name
    • cache_key
    • node_id
    • outputs
    • created_at
    • expires_at
  • Use (namespace, cache_name, cache_key) as the primary key
  • Add an expires_at index for pruning
  • Validate table names against ^[A-Za-z0-9_]+$
  • Use parameterized queries for all runtime values
  • Support Postgres upsert (ON CONFLICT) and MySQL upsert (ON DUPLICATE KEY UPDATE)
  • Refresh created_at and expires_at when an existing cache entry is replaced
  • Use table-specific migration bookkeeping so changing the memoization tableName does not reuse another table's schema history

Cache routing and TTL behavior

  • Keep ConfigMap caching as the default when memoization: is omitted
  • Keep memoize.cache.configMap.name as the logical cache name for both backends
  • For SQL, no ConfigMap is created; that name is stored as cache_name in the database
  • Extend MemoizationCache.Save() to accept maxAge
  • Resolve TTL from memoize.maxAge, or DEFAULT_MAX_AGE, or a 30-day default for the SQL backend
  • Store SQL expiry as expires_at at write time
  • Run SQL cache GC on MEMO_CACHE_GC_PERIOD (default 24h) and prune rows where expires_at < now

Workflow execution paths

  • Wire the new save/load behavior through operator.go, dag.go, steps.go, and taskset.go
  • Keep load/save error handling explicit: malformed maxAge or DB errors still surface as node/controller errors
  • Guard memoization call sites so an unavailable SQL backend does not panic the controller

Mermaid diagrams

Backend selection / routing

flowchart TD
  A[Workflow node uses memoize] --> B[get memoization cache]
  B --> C{memoization config set}
  C -->|no| D[use ConfigMap cache]
  C -->|yes| E[ensure SQL backend]
  E --> F{SQL backend ready}
  F -->|yes| G[use SQL cache]
  F -->|no| H[return nil cache]
  D --> I[load or save via ConfigMap]
  G --> J[load or save via SQL]
  H --> K[treat lookup as cache miss]
  H --> L[skip cache save]
Loading

ConfigMap path

sequenceDiagram
  participant N as Workflow node
  participant C as WorkflowController
  participant F as cacheFactory
  participant M as ConfigMapCache
  participant K as Kubernetes API

  N->>C: getMemoizationCache(ns, cacheName)
  C->>F: GetCache(ConfigMapCache, ns, cacheName)
  F-->>C: ConfigMapCache

  alt load
    C->>M: Load(key)
    M->>K: GET ConfigMap
    K-->>M: ConfigMap / NotFound
    alt entry exists
      M->>K: UPDATE lastHitTimestamp
      M-->>C: Entry
    else miss
      M-->>C: nil
    end
  else save
    Note over M: maxAge is ignored for ConfigMap storage
    C->>M: Save(key, nodeID, outputs, maxAge)
    M->>K: GET or CREATE ConfigMap
    M->>K: UPDATE entry JSON
    M-->>C: ok / error
  end
Loading

SQL path

sequenceDiagram
  participant N as Workflow node
  participant C as WorkflowController
  participant F as cacheFactory
  participant S as sqlDBCache
  participant D as SQL database
  participant G as GC ticker

  N->>C: getMemoizationCache(ns, cacheName)
  C->>C: ensureMemoizationBackend()
  alt backend not ready yet
    C->>D: connect and run migration
    D-->>C: ready or unavailable
  end

  alt SQL backend ready
    C->>F: GetCache(ns, cacheName)
    F-->>C: sqlDBCache

    alt load
      C->>S: Load(key)
      S->>D: select active entry
      D-->>S: row or no row
      S-->>C: entry or nil
    else save
      C->>S: Save(key, nodeID, outputs, maxAge)
      S->>D: upsert row with created_at and expires_at
      D-->>S: ok
      S-->>C: ok or error
    end

    G->>D: delete expired rows on each GC tick
  else SQL configured but unavailable
    C-->>N: nil cache
    C-->>N: caller treats it as miss or skip-save
  end
Loading

Current semantics

  • ConfigMap remains the default backend when memoization: is not configured
  • memoize.key and memoize.cache.configMap.name remain the workflow-side inputs
  • On the SQL backend, cache.configMap.name becomes a logical cache group name; no ConfigMap is created
  • SQL stores expiry at write time. If memoize.maxAge is omitted, SQL entries still get a default TTL (DEFAULT_MAX_AGE or 30 days)

Verification

  • Added or updated unit coverage for:
    • util/memo/db/queries.go (Postgres and MySQL save/load/upsert/expiry/prune)
    • workflow/controller/cache/sqldb_cache.go
    • workflow/controller/config.go retry behavior
    • cache factory namespace scoping
  • Added corefunctional e2e coverage for SQL memoization:
    • first execution saves to the DB
    • second execution hits the cache
    • no ConfigMap is created when SQL memoization is enabled

Documentation

  • Update docs/memoization.md with SQL backend usage, TTL behavior, and env vars
  • Update docs/environment-variables.md for DEFAULT_MAX_AGE and MEMO_CACHE_GC_PERIOD
  • Update docs/workflow-controller-configmap.md with MemoizationConfig
  • Add memoization examples to the Postgres controller manifests

AI

Claude Code was used in earlier preparation of this PR. GitHub Copilot CLI was used to update this PR description.

@droctothorpe droctothorpe force-pushed the main branch 6 times, most recently from 9022071 to 7674c7f Compare April 15, 2026 23:35
@Joibel
Copy link
Copy Markdown
Member

Joibel commented Apr 16, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

Introduces SQL database-backed memoization caching for PostgreSQL/MySQL as an alternative to Kubernetes ConfigMaps. Adds configuration structures, database migrations, query operations, cache factory integration, controller initialization, garbage collection scheduling, documentation, manifest examples, and end-to-end tests.

Changes

Cohort / File(s) Summary
Feature Documentation
.features/pending/sql-memoization-cache.md, docs/memoization.md, docs/workflow-controller-configmap.md, docs/environment-variables.md
Added documentation describing the SQL backend memoization cache option, removal of ConfigMap 1 MB size limit, cacheTTL control, GC scheduling via MEMO_CACHE_GC_PERIOD, and updated RBAC requirements to specify ConfigMap-specific verbs.
Configuration
config/config.go, .spelling
Extended Config struct with Memoization field; introduced MemoizationConfig type embedding DBConfig with TableName, SkipMigration, and CacheTTL fields; updated spellcheck dictionary.
Manifest Files
manifests/components/postgres/overlays/workflow-controller-configmap.yaml, manifests/quick-start-postgres.yaml
Added memoization configuration blocks specifying PostgreSQL connection parameters, table names, and Kubernetes secret references for credentials in ConfigMap and quick-start manifests.
Memoization Database Module
util/memo/db/config.go, util/memo/db/migrate.go, util/memo/db/queries.go, util/memo/db/queries_test.go
New database abstraction layer with config helpers, schema migration logic (table creation, index management), and CRUD query operations (Load, Save, Prune) for memoization cache entries; includes integration tests using throwaway PostgreSQL container.
Cache Factory & SQL Backend
workflow/controller/cache/cache.go, workflow/controller/cache/sqldb_cache.go
Extended Factory interface with SetSessionProxy method; added sqlDBCache type implementing memoization cache backed by SQL database with key validation, record marshaling/unmarshaling, and timestamp handling.
Controller Integration
workflow/controller/config.go, workflow/controller/controller.go
Initialized memoization database session proxy and schema migrations in updateConfig; added memoizationCacheGarbageCollector goroutine to periodically prune expired entries based on MEMO_CACHE_GC_PERIOD and configured TTL.
End-to-End Testing
test/e2e/sqldb_memoize_test.go
New e2e test suite validating SQL-backed memoization behavior: workflow submission with cache hits/saves, database persistence verification via direct DB query, and assertion that no ConfigMap fallback occurs.

Sequence Diagrams

sequenceDiagram
    participant Workflow as Workflow Node
    participant Controller as Controller
    participant Cache as Cache Factory
    participant DB as SQL Database
    
    Workflow->>Controller: Execute node with memoize
    Controller->>Cache: GetCache(name)
    alt Database configured
        Cache->>DB: Load(namespace, cacheName, key)
        DB-->>Cache: CacheRecord or nil
        Cache-->>Controller: Entry or nil
    else No database
        Cache->>Cache: Return ConfigMap cache
    end
    
    alt Cache hit
        Controller-->>Workflow: Return memoized output
    else Cache miss
        Workflow->>Controller: Execute and produce output
        Controller->>Cache: Save(namespace, cacheName, key, outputs)
        Cache->>DB: Save record with outputs
        DB-->>Cache: Confirmed
    end
Loading
sequenceDiagram
    participant GC as GC Goroutine
    participant Cache as Queries
    participant DB as SQL Database
    
    loop Every MEMO_CACHE_GC_PERIOD
        GC->>Cache: Prune(maxAge=cacheTTL)
        Cache->>DB: DELETE WHERE last_hit_at < now - maxAge
        DB-->>Cache: Row count deleted
        Cache-->>GC: Rows pruned
        GC->>GC: Log result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a SQL database-backed memoization cache feature, which is the central addition across the codebase.
Description check ✅ Passed The PR description comprehensively documents the feature, including motivation, modifications, architecture diagrams, verification, and documentation updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
workflow/controller/cache/cache.go (1)

78-81: Stale comment for cache type constant.

The comment "Only config maps are currently supported for caching" is no longer accurate now that SQL database caching is also supported.

📝 Suggested fix
 const (
-	// Only config maps are currently supported for caching
+	// ConfigMapCache is a cache type identifier used as a key prefix in the cache map.
+	// Despite the name, when a session proxy is configured, SQL-backed caching is used instead.
 	ConfigMapCache Type = "ConfigMapCache"
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workflow/controller/cache/cache.go` around lines 78 - 81, The comment above
the cache type constant is stale: update the comment for the const block (where
Type and ConfigMapCache are defined) to reflect that both ConfigMapCache and
SQL/database-backed caching are supported now instead of saying "Only config
maps are currently supported for caching"; edit the comment to accurately
describe the supported cache types (e.g., config maps and SQL/database caches)
so it matches the implementation.
docs/memoization.md (1)

106-106: Minor style improvement.

Consider a more concise phrasing.

📝 Suggested fix
-    In order to use memoization with the ConfigMap backend it is necessary to add the verbs `create` and `update` to the `configmaps` resource for the appropriate (cluster) roles.
+    To use memoization with the ConfigMap backend, add the verbs `create` and `update` to the `configmaps` resource for the appropriate (cluster) roles.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/memoization.md` at line 106, Reword the sentence about ConfigMap
memoization to be more concise: replace the long sentence with a shorter one
that states that using memoization with the ConfigMap backend requires adding
the create and update verbs to the configmaps resource on the appropriate role
(argo-cluster-role for cluster installs, argo-role for namespace installs), and
note that this is unnecessary for the SQL backend; update the text around the
symbols `configmaps`, `create`, `update`, `argo-cluster-role`, and `argo-role`
accordingly.
test/e2e/sqldb_memoize_test.go (1)

142-142: Consider removing the unused import workaround.

The _ = db.ErrNoMoreRows statement exists solely to satisfy the import checker. If the db package isn't otherwise needed, consider removing the import entirely. If it's needed for clarity or documentation purposes, this workaround is acceptable but non-idiomatic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/sqldb_memoize_test.go` at line 142, The test contains a
non-idiomatic unused-import workaround `_ = db.ErrNoMoreRows`; remove the
workaround by deleting the import of the `db` package if it's not used elsewhere
in test/e2e/sqldb_memoize_test.go, or if `db` is intentionally kept for
documentation, replace the assignment with a comment explaining the reasons and
ensure no references to `db.ErrNoMoreRows` remain; locate the import and the `_
= db.ErrNoMoreRows` line and either remove the import and the assignment or add
a clear comment and use a real reference to `db` in the test to justify keeping
the import.
util/memo/db/queries_test.go (1)

127-136: Consider increasing sleep duration to reduce potential test flakiness.

The 5ms sleep may not provide sufficient timing separation on slower CI runners or under load. PostgreSQL's timestamp precision is microseconds, but the actual DB round-trip latency could exceed 5ms in some cases, causing the After assertion to fail sporadically.

💡 Suggested improvement
 	// Wait a moment so the second load gets a later timestamp.
-	time.Sleep(5 * time.Millisecond)
+	time.Sleep(50 * time.Millisecond)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@util/memo/db/queries_test.go` around lines 127 - 136, The 5ms pause in the
test may be too short and causes flakiness; update the sleep in the test (the
time.Sleep call used before calling q.Load to get `second`) to a larger duration
(e.g., 50–100ms) or replace it with a small retry/wait loop that polls until
`second.LastHitAt` is strictly after `first.LastHitAt`; target the time.Sleep
invocation and the subsequent calls to q.Load/LastHitAt in this test so the
assertion assert.True(second.LastHitAt.After(first.LastHitAt), ...) reliably
passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/config.go`:
- Around line 360-372: The doc comment for MemoizationConfig.CacheTTL is
incorrect: current garbage collector logic in memoizationCacheGarbageCollector
treats TTL==0 as "use default (90 days)" and only negative values disable GC;
update the comment on the CacheTTL field (type TTL) to state that 0 means "use
default 90 days" and that a negative value disables GC (or alternatively, if you
prefer behavior from the comment, change memoizationCacheGarbageCollector to
treat 0 as disable and adjust tests). Reference MemoizationConfig, CacheTTL (TTL
type) and memoizationCacheGarbageCollector when making the change.

In `@docs/memoization.md`:
- Around line 57-58: Update the docs text for the cacheTTL behavior to match the
controller: state that cacheTTL defaults to "2160h" (90 days), that setting it
to "0" will use the default TTL, and that only negative values disable automatic
pruning; reference the cacheTTL field name so readers can correlate it with the
controller implementation.

In `@util/memo/db/config.go`:
- Around line 31-36: ConfigFromConfig may dereference a nil cfg via
cfg.SkipMigration; update ConfigFromConfig to guard against nil by checking if
cfg == nil and using a sensible default (e.g., false) for SkipMigration while
still calling TableName(cfg) (which already handles nil). Ensure you reference
ConfigFromConfig and the cfg.SkipMigration access when adding the nil-check and
setting the default SkipMigration value.

In `@util/memo/db/migrate.go`:
- Around line 11-25: The migrate function concatenates tableName into raw SQL
(sqldb.AnsiSQLChange) which risks SQL injection and creates a likely-redundant
index; update migrate so tableName is validated and/or safely quoted before
being used (e.g., allow only a strict regex like ^[A-Za-z0-9_]+$ and then use a
safe identifier-quoting helper or sqldb-provided quoting function if available)
and remove the redundant imemo_namespace_cache index change (the primary key
already covers (namespace, cache_name)); reference the migrate function, the
tableName parameter, sqldb.AnsiSQLChange invocations, and versionTable when
making these changes.

In `@workflow/controller/controller.go`:
- Around line 727-771: In memoizationCacheGarbageCollector, treat cache TTL == 0
as "disable GC" (log and return) instead of falling back to defaultCacheTTL;
preserve the existing negative-TTL disable behavior and only use defaultCacheTTL
when the configured TTL is omitted/unspecified, then ensure cfg :=
memodb.ConfigFromConfig(memoCfg) and queries := memodb.NewQueries(cfg.TableName)
are created once outside the for loop (so queries.Prune is called on the same
queries object each tick) — adjust the ttl logic around variables
defaultCacheTTL and ttl and move the cfg/queries setup outside the loop in the
memoizationCacheGarbageCollector function.

---

Nitpick comments:
In `@docs/memoization.md`:
- Line 106: Reword the sentence about ConfigMap memoization to be more concise:
replace the long sentence with a shorter one that states that using memoization
with the ConfigMap backend requires adding the create and update verbs to the
configmaps resource on the appropriate role (argo-cluster-role for cluster
installs, argo-role for namespace installs), and note that this is unnecessary
for the SQL backend; update the text around the symbols `configmaps`, `create`,
`update`, `argo-cluster-role`, and `argo-role` accordingly.

In `@test/e2e/sqldb_memoize_test.go`:
- Line 142: The test contains a non-idiomatic unused-import workaround `_ =
db.ErrNoMoreRows`; remove the workaround by deleting the import of the `db`
package if it's not used elsewhere in test/e2e/sqldb_memoize_test.go, or if `db`
is intentionally kept for documentation, replace the assignment with a comment
explaining the reasons and ensure no references to `db.ErrNoMoreRows` remain;
locate the import and the `_ = db.ErrNoMoreRows` line and either remove the
import and the assignment or add a clear comment and use a real reference to
`db` in the test to justify keeping the import.

In `@util/memo/db/queries_test.go`:
- Around line 127-136: The 5ms pause in the test may be too short and causes
flakiness; update the sleep in the test (the time.Sleep call used before calling
q.Load to get `second`) to a larger duration (e.g., 50–100ms) or replace it with
a small retry/wait loop that polls until `second.LastHitAt` is strictly after
`first.LastHitAt`; target the time.Sleep invocation and the subsequent calls to
q.Load/LastHitAt in this test so the assertion
assert.True(second.LastHitAt.After(first.LastHitAt), ...) reliably passes.

In `@workflow/controller/cache/cache.go`:
- Around line 78-81: The comment above the cache type constant is stale: update
the comment for the const block (where Type and ConfigMapCache are defined) to
reflect that both ConfigMapCache and SQL/database-backed caching are supported
now instead of saying "Only config maps are currently supported for caching";
edit the comment to accurately describe the supported cache types (e.g., config
maps and SQL/database caches) so it matches the implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2dfc6cab-ef64-4e7d-88b4-524b575c2deb

📥 Commits

Reviewing files that changed from the base of the PR and between 51ea54b and 0d9402b.

📒 Files selected for processing (17)
  • .features/pending/sql-memoization-cache.md
  • .spelling
  • config/config.go
  • docs/environment-variables.md
  • docs/memoization.md
  • docs/workflow-controller-configmap.md
  • manifests/components/postgres/overlays/workflow-controller-configmap.yaml
  • manifests/quick-start-postgres.yaml
  • test/e2e/sqldb_memoize_test.go
  • util/memo/db/config.go
  • util/memo/db/migrate.go
  • util/memo/db/queries.go
  • util/memo/db/queries_test.go
  • workflow/controller/cache/cache.go
  • workflow/controller/cache/sqldb_cache.go
  • workflow/controller/config.go
  • workflow/controller/controller.go

Comment thread config/config.go
Comment thread docs/memoization.md Outdated
Comment thread util/memo/db/config.go
Comment thread util/memo/db/migrate.go
Comment thread workflow/controller/controller.go
@Joibel
Copy link
Copy Markdown
Member

Joibel commented Apr 16, 2026

Claude Code review

Found 18 issues:

  1. tableName in every doc example and overlay manifest is nested inside postgresql:/mysql: blocks, but the code reads only the top-level MemoizationConfig.TableName. The nested value is silently ignored; it only appears to work because the value matches the default.

data:
memoization: |
postgresql:
host: postgres
port: 5432
database: postgres
tableName: memoization_cache
userNameSecret:
name: argo-postgres-config
key: username
passwordSecret:
name: argo-postgres-config
key: password
cacheTTL: "2160h"
```

func TableName(cfg *config.MemoizationConfig) string {
if cfg == nil || cfg.TableName == "" {
return defaultTableName
}
return cfg.TableName
}
func ConfigFromConfig(cfg *config.MemoizationConfig) Config {
return Config{
TableName: TableName(cfg),
SkipMigration: cfg.SkipMigration,
}
}

  1. MEMO_CACHE_GC_PERIOD=0s causes time.NewTicker(0) to panic. The sibling ConfigMap GC explicitly guards if cacheGCPeriod != 0; this new goroutine does not.

}
periodicity := env.LookupEnvDurationOr(ctx, "MEMO_CACHE_GC_PERIOD", 24*time.Hour)
logger.WithFields(logging.Fields{"ttl": ttl, "periodicity": periodicity}).Info(ctx, "Starting memoization cache GC")
ticker := time.NewTicker(periodicity)
defer ticker.Stop()
cfg := memodb.ConfigFromConfig(memoCfg)
queries := memodb.NewQueries(cfg.TableName)

  1. Prune calls sp.Session().SQL().DeleteFrom(...).Exec() directly, bypassing sp.With(); SessionProxy.Session() godoc explicitly warns this path loses reconnect-on-error. Running on a 24h ticker, the connection is very likely to have gone idle — failures then persist until controller restart. Load/Save correctly use TxWith.

// Prune deletes cache entries whose last_hit_at is older than maxAge. It is called periodically
// by the controller to bound the size of the memoization_cache table.
func (q *Queries) Prune(ctx context.Context, sp *sqldb.SessionProxy, maxAge time.Duration) (int64, error) {
cutoff := time.Now().Add(-maxAge)
result, err := sp.Session().SQL().
DeleteFrom(q.tableName).
Where("last_hit_at < ?", cutoff).
Exec()
if err != nil {
return 0, err
}
n, err := result.RowsAffected()
return n, err
}

  1. When the memoization block is removed from the ConfigMap at runtime, the else branch only logs — it never calls cacheFactory.SetSessionProxy(nil, \"\"), so SQL routing continues until controller restart. Also, a failed initial DB connection leaves memoSessionProxy=nil and the if wfc.memoSessionProxy == nil guard never retries.

}
if memoCfg := wfc.Config.Memoization; memoCfg != nil {
logger.Info(ctx, "Memoization database configuration enabled")
if wfc.memoSessionProxy == nil {
sp := memodb.SessionProxyFromConfig(ctx, wfc.kubeclientset, wfc.namespace, memoCfg)
wfc.memoSessionProxy = sp
}
if wfc.memoSessionProxy != nil {
cfg := memodb.ConfigFromConfig(memoCfg)
memodb.Migrate(ctx, wfc.memoSessionProxy, cfg)
wfc.cacheFactory.SetSessionProxy(wfc.memoSessionProxy, cfg.TableName)
}
} else {
logger.Info(ctx, "Memoization database configuration disabled; using ConfigMap-based caching")
}

  1. memodb.Migrate logs migration failure as a warning but returns no error, and the caller still proceeds to SetSessionProxy. Compare util/sync/db/config.go which clears SessionProxy on failure, and persistence initDB which is fatal. Subsequent Load/Save will fail silently on every memoized node.

// Migrate runs database migrations for the memoization cache table. It is a no-op when
// cfg.SkipMigration is true.
func Migrate(ctx context.Context, sessionProxy *sqldb.SessionProxy, cfg Config) {
if sessionProxy == nil {
return
}
logger := logging.RequireLoggerFromContext(ctx)
if cfg.SkipMigration {
logger.Info(ctx, "Memoization db migration skipped")
return
}
logger.Info(ctx, "Running memoization db migration")
if err := migrate(ctx, sessionProxy.Session(), sessionProxy.DBType(), cfg.TableName); err != nil {
logger.WithError(err).Warn(ctx, "memoization db migration failed; database-backed memoization will not work")
} else {
logger.Info(ctx, "Memoization db migration complete")
}
}

  1. outputs stored as raw text via json.Marshal. Postgres rejects \u0000 null bytes in text columns; PR fix: search and replace null bytes for postgres only.Fixes #13711 #14030 fixed exactly this class of bug in persist/sqldb/workflow_archive.go — the same null-byte handling is not applied here. Memoized step outputs can contain null-byte escapes.

}
func (q *Queries) Save(ctx context.Context, sp *sqldb.SessionProxy, namespace, cacheName, key, nodeID string, outputs *wfv1.Outputs) error {
outputsJSON, err := json.Marshal(outputs)
if err != nil {
return fmt.Errorf("unable to marshal memoization outputs: %w", err)
}
now := time.Now()
return sp.TxWith(ctx, func(txProxy *sqldb.SessionProxy) error {
sess := txProxy.Session()
_, err := sess.SQL().
DeleteFrom(q.tableName).
Where(db.Cond{
"namespace": namespace,
"cache_name": cacheName,
"key": key,
}).
Exec()
if err != nil {
return err
}
_, err = sess.Collection(q.tableName).Insert(&CacheRecord{
Namespace: namespace,
CacheName: cacheName,
Key: key,
NodeID: nodeID,
Outputs: string(outputsJSON),
CreatedAt: now,
LastHitAt: now,
})
return err
}, nil)
}

  1. outputs text column is capped at 65,535 bytes on MySQL, contradicting the headline claim of replacing the 1 MB ConfigMap limit. persist/sqldb/migrate.go uses ByType with MySQL-specific JSON/LONGTEXT conversion for exactly this reason; this migration does not.

func migrate(ctx context.Context, session db.Session, dbType sqldb.DBType, tableName string) error {
return sqldb.Migrate(ctx, session, dbType, versionTable, []sqldb.Change{
sqldb.AnsiSQLChange(`create table if not exists ` + tableName + ` (
namespace varchar(256) not null,
cache_name varchar(256) not null,
key varchar(256) not null,
node_id text not null,
outputs text not null,
created_at timestamp not null,
last_hit_at timestamp not null,
primary key (namespace, cache_name, key)
)`),
sqldb.AnsiSQLChange(`create index if not exists imemo_namespace_cache on ` + tableName + ` (namespace, cache_name)`),
sqldb.AnsiSQLChange(`create index if not exists imemo_last_hit_at on ` + tableName + ` (last_hit_at)`),
})
}

  1. Every Load (cache hit) opens a transaction and issues SELECT + UPDATE to refresh last_hit_at. With the stated scale motivation ("thousands of memoized nodes"), every cache read becomes a write with transaction overhead — consider coalescing/debouncing last_hit_at updates.

// Load retrieves the outputs for the given key and updates last_hit_at. Returns nil when the entry
// does not exist.
func (q *Queries) Load(ctx context.Context, sp *sqldb.SessionProxy, namespace, cacheName, key string) (*CacheRecord, error) {
var record *CacheRecord
err := sp.TxWith(ctx, func(txProxy *sqldb.SessionProxy) error {
sess := txProxy.Session()
var r CacheRecord
err := sess.Collection(q.tableName).
Find(db.Cond{
"namespace": namespace,
"cache_name": cacheName,
"key": key,
}).
One(&r)
if errors.Is(err, db.ErrNoMoreRows) {
return nil
}
if err != nil {
return err
}
r.LastHitAt = time.Now()
_, err = sess.SQL().
Update(q.tableName).
Set("last_hit_at", r.LastHitAt).
Where(db.Cond{
"namespace": namespace,
"cache_name": cacheName,
"key": key,
}).
Exec()
if err != nil {
return err
}
record = &r
return nil
}, nil)
return record, err
}

  1. cf.namespace is the controller's install namespace (from NewCacheFactory(kubeclient, ns)), not the workflow's namespace. Every row's namespace column has the same value, so the column is non-partitioning and the (namespace, cache_name, key) PK misleads about multi-tenant isolation.

type cacheFactory struct {
caches map[string]MemoizationCache
kubeclient kubernetes.Interface
namespace string
lock sync.RWMutex
sessionProxy *sqldb.SessionProxy
tableName string
}
type Factory interface {
GetCache(ct Type, name string) MemoizationCache
// SetSessionProxy configures the factory to use database-backed caching. Calling this clears
// any previously created cache instances so they are recreated against the new backend.
SetSessionProxy(sp *sqldb.SessionProxy, tableName string)
}
func NewCacheFactory(ki kubernetes.Interface, ns string) Factory {
return &cacheFactory{
caches: make(map[string]MemoizationCache),
kubeclient: ki,
namespace: ns,
}
}

  1. Stale comment // Only config maps are currently supported for caching directly above ConfigMapCache. After this PR, GetCache(ConfigMapCache, name) routes to SQL when sessionProxy != nil; the Type enum is functionally dead and the comment is false.

type Type string
const (
// Only config maps are currently supported for caching
ConfigMapCache Type = "ConfigMapCache"
)
// SetSessionProxy configures the factory to use a database session for new caches, clearing any
// previously cached instances.
func (cf *cacheFactory) SetSessionProxy(sp *sqldb.SessionProxy, tableName string) {
cf.lock.Lock()
defer cf.lock.Unlock()
cf.sessionProxy = sp
cf.tableName = tableName
cf.caches = make(map[string]MemoizationCache)
}
// Returns a cache if it exists and creates it otherwise
func (cf *cacheFactory) GetCache(ct Type, name string) MemoizationCache {
cf.lock.RLock()
idx := string(ct) + "." + name
if c := cf.caches[idx]; c != nil {
cf.lock.RUnlock()
return c
}
cf.lock.RUnlock()
cf.lock.Lock()
defer cf.lock.Unlock()
if c := cf.caches[idx]; c != nil {
return c
}
switch ct {
case ConfigMapCache:
var c MemoizationCache
if cf.sessionProxy != nil {
c = newSQLDBCache(cf.namespace, name, cf.sessionProxy, cf.tableName)
} else {
c = NewConfigMapCache(cf.namespace, cf.kubeclient, name)
}
cf.caches[idx] = c
return c
default:
return nil
}
}

  1. .features/pending/sql-memoization-cache.md has Issues: 15938 — the PR number itself. hack/featuregen generates a link to /issues/15938 which GitHub resolves to the PR, not a feature-request issue. Convention in .features/released/** is to reference the originating issue.

Description: Add SQL database-backed memoization cache as an alternative to ConfigMaps.
Authors: [droctothorpe](https://github.com/droctothorpe)
Component: General
Issues: 15938
Memoization can now store cache entries in a PostgreSQL or MySQL database instead of Kubernetes ConfigMaps.
The SQL backend removes the 1 MB ConfigMap size limit and persists cache entries across cluster restarts.
ConfigMaps remain the default; opt in by adding a `memoization` block to the `workflow-controller-configmap`.
A `cacheTTL` field controls how long unaccessed entries are retained (default: `2160h` / 90 days).
No changes to workflow specs are required.

  1. _ = db.ErrNoMoreRows // keep import used is dead code; db from upper/db/v4 is otherwise unused in the e2e test and the import can simply be removed.

s.NotEmpty(outputs)
_ = db.ErrNoMoreRows // keep import used
}

  1. MEMO_CACHE_GC_PERIOD is out of alphabetical order in the Controller env-var table (placed between CACHE_GC_AFTER_NOT_HIT_DURATION and CRON_SYNC_PERIOD instead of after MAX_OPERATION_TIME).

| `CACHE_GC_PERIOD` | `time.Duration` | `0s` | How often to perform memoization cache GC, which is disabled by default and can be enabled by providing a non-zero duration. |
| `CACHE_GC_AFTER_NOT_HIT_DURATION` | `time.Duration` | `30s` | When a memoization cache has not been hit after this duration, it will be deleted. |
| `MEMO_CACHE_GC_PERIOD` | `time.Duration` | `24h` | How often the SQL-backed memoization cache garbage collector runs to prune entries that have exceeded their TTL. |
| `CRON_SYNC_PERIOD` | `time.Duration` | `10s` | How often to sync cron workflows. |
| `DEFAULT_REQUEUE_TIME` | `time.Duration` | `10s` | The re-queue time for the rate limiter of the workflow queue. |

  1. Docs say cacheTTL: \"0\" disables GC, but code treats ttl == 0 as "use default 90 days" — only ttl < 0 disables. Users following the documented path get 90-day pruning instead of disabled GC.

}
const defaultCacheTTL = 90 * 24 * time.Hour // 3 months
ttl := time.Duration(memoCfg.CacheTTL)
if ttl == 0 {
ttl = defaultCacheTTL
}
if ttl < 0 {
logger.Info(ctx, "Memoization cache TTL is negative - cache GC disabled")
return
}

The `cacheTTL` field controls how long cache entries that have not been accessed are retained.
It defaults to `"2160h"` (90 days). Set to `"0"` to disable automatic pruning.
!!! Note

  1. Save does DELETE-then-INSERT inside a transaction instead of UPSERT (ON CONFLICT DO UPDATE / ON DUPLICATE KEY UPDATE). Two concurrent Saves to the same (ns, cache_name, key) can both DELETE (seeing empty snapshot) and both INSERT, producing a PK-violation on the second.

return n, err
}
func (q *Queries) Save(ctx context.Context, sp *sqldb.SessionProxy, namespace, cacheName, key, nodeID string, outputs *wfv1.Outputs) error {
outputsJSON, err := json.Marshal(outputs)
if err != nil {
return fmt.Errorf("unable to marshal memoization outputs: %w", err)
}
now := time.Now()
return sp.TxWith(ctx, func(txProxy *sqldb.SessionProxy) error {
sess := txProxy.Session()
_, err := sess.SQL().
DeleteFrom(q.tableName).
Where(db.Cond{
"namespace": namespace,
"cache_name": cacheName,
"key": key,
}).
Exec()
if err != nil {
return err
}
_, err = sess.Collection(q.tableName).Insert(&CacheRecord{
Namespace: namespace,
CacheName: cacheName,
Key: key,
NodeID: nodeID,
Outputs: string(outputsJSON),
CreatedAt: now,
LastHitAt: now,
})
return err
}, nil)
}

  1. Prune accepts ctx context.Context but never propagates it — .Exec() is called without context binding, so controller shutdown cannot cancel a long DELETE. Load/Save plumb ctx via TxWith.

// Prune deletes cache entries whose last_hit_at is older than maxAge. It is called periodically
// by the controller to bound the size of the memoization_cache table.
func (q *Queries) Prune(ctx context.Context, sp *sqldb.SessionProxy, maxAge time.Duration) (int64, error) {
cutoff := time.Now().Add(-maxAge)
result, err := sp.Session().SQL().
DeleteFrom(q.tableName).
Where("last_hit_at < ?", cutoff).
Exec()
if err != nil {
return 0, err
}
n, err := result.RowsAffected()
return n, err
}

  1. SessionProxyFromConfig omits sqldb.ConfigureDBSession, so MemoizationConfig.DBConfig.ConnectionPool is silently ignored — connection pool settings have no effect. Persistence calls ConfigureDBSession in workflow/controller/config.go for exactly this.

// SessionProxyFromConfig creates a SessionProxy from a MemoizationConfig, returning nil and logging
// an error if the connection cannot be established. Callers that receive nil should fall back to
// ConfigMap-based caching.
func SessionProxyFromConfig(ctx context.Context, kubectlConfig kubernetes.Interface, namespace string, cfg *config.MemoizationConfig) *sqldb.SessionProxy {
if cfg == nil {
return nil
}
sessionProxy, err := sqldb.NewSessionProxy(ctx, sqldb.SessionProxyConfig{
KubectlConfig: kubectlConfig,
Namespace: namespace,
DBConfig: cfg.DBConfig,
})
if err != nil {
log := logging.RequireLoggerFromContext(ctx)
log.WithError(err).Error(ctx, "unable to create memoization database connection")
return nil
}
return sessionProxy
}

  1. Column named key (a MySQL reserved word) and raw-SQL test queries using WHERE key = ? without backticks will fail on MySQL.

sqldb.AnsiSQLChange(`create table if not exists ` + tableName + ` (
namespace varchar(256) not null,
cache_name varchar(256) not null,
key varchar(256) not null,
node_id text not null,
outputs text not null,
created_at timestamp not null,
last_hit_at timestamp not null,
primary key (namespace, cache_name, key)
)`),
sqldb.AnsiSQLChange(`create index if not exists imemo_namespace_cache on ` + tableName + ` (namespace, cache_name)`),
sqldb.AnsiSQLChange(`create index if not exists imemo_last_hit_at on ` + tableName + ` (last_hit_at)`),
})

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@droctothorpe
Copy link
Copy Markdown
Contributor Author

droctothorpe commented Apr 17, 2026

All 18 issues from the review have been addressed in 93b5b19:

Issue Solution
1. tableName nested inside postgresql:/mysql: blocks is silently ignored; code reads top-level MemoizationConfig.TableName Moved tableName to top-level memoization: block in docs, overlay manifest, and quick-start manifest
2. MEMO_CACHE_GC_PERIOD=0s causes time.NewTicker(0) panic Added periodicity <= 0 guard before NewTicker
3. Prune calls sp.Session() directly, bypassing reconnect-on-error Changed Prune to use sp.With(ctx, ...)
4. Removing memoization config block at runtime doesn't clear SQL routing Added else branch to call SetSessionProxy(nil, "") and nil out memoSessionProxy
5. Migrate swallows errors; caller proceeds to SetSessionProxy Changed Migrate to return error; caller falls back to ConfigMap on failure
6. outputs stored as raw text without null-byte handling breaks Postgres Added postgresNullReplacement handling in Save/Load (same pattern as workflow_archive.go)
7. MySQL text column is capped at 64KB Used sqldb.ByType to split migration: text for Postgres, longtext for MySQL
8. Every Load opens a transaction with SELECT + UPDATE for last_hit_at Debounced: last_hit_at is only updated if older than 1 hour; reads are now transaction-free
9. namespace column is always the controller install namespace, not the workflow namespace Added clarifying comment on the namespace field documenting this behavior
10. Stale comment "Only config maps are currently supported for caching" Updated comment to reflect SQL-backed caching support
11. .features/pending/ file has Issues: 15938 (the PR number, not a feature-request issue) Changed to PRs: 15938
12. _ = db.ErrNoMoreRows dead code keeping unused import alive Removed import and workaround line
13. MEMO_CACHE_GC_PERIOD out of alphabetical order in env-var docs table Moved to after MAX_OPERATION_TIME
14. Docs say cacheTTL: "0" disables GC, but code treats 0 as default 90 days Fixed all docs: "0" uses the default; negative value disables GC
15. Save DELETE+INSERT race: concurrent Saves can both DELETE then both INSERT, causing PK violation Replaced with ON CONFLICT DO UPDATE (Postgres) / ON DUPLICATE KEY UPDATE (MySQL)
16. Prune doesn't propagate context; shutdown can't cancel long DELETE Now uses sp.With(ctx, ...) which propagates context
17. SessionProxyFromConfig omits ConfigureDBSession; connection pool settings silently ignored Added sqldb.ConfigureDBSession call after creating the session proxy
18. Column key is a MySQL reserved word; raw SQL queries without quoting fail on MySQL Used backtick-quoting for MySQL, double-quote for Postgres in migration DDL and upsert SQL

@droctothorpe droctothorpe force-pushed the main branch 4 times, most recently from c9f9395 to c2d4d16 Compare April 19, 2026 04:56
@droctothorpe droctothorpe force-pushed the main branch 2 times, most recently from e09b1c8 to 66573a9 Compare April 19, 2026 11:47
@droctothorpe
Copy link
Copy Markdown
Contributor Author

droctothorpe commented Apr 19, 2026

Some important design questions that we should probably discuss:

  • We see significant pressure on the pod used for workflow archives as workflow quantity and complexity increases. Does it make sense to provide a standalone database for the cache to decouple scaling considerations and reduce the blast radius?
  • How do you feel about the blank memoize: {} interface? The goal is to make it as turnkey as possible for end users ofc, but maybe there's a better way to design the UX.
  • How do you feel about 3 months as the default TTL for cache entries?

Comment thread docs/workflow-controller-configmap.md Outdated
@Joibel
Copy link
Copy Markdown
Member

Joibel commented Apr 20, 2026

* We see significant pressure on the pod used for workflow archives as workflow quantity and complexity increases. Does it make sense to provide a standalone database for the cache to decouple scaling considerations and reduce the blast radius?

We've already got that here, haven't we? Well, you can use the same one, or a different one, depending upon how you configure it.

* How do you feel about the blank `memoize: {}` interface? The goal is to make it as turnkey as possible for end users ofc, but maybe there's a better way to design the UX.

You appear to have implemented multiple features here. It's big enough and AI enough as is without automatic key generation - please don't do that in this PR.

* How do you feel about 3 months as the default TTL for cache entries?

It's way bigger than I'd expect. We already have maxAge, which isn't perfect. Again, split this into a separate feature if you'd like TTL. It should apply to any backend equally and can be reasoned about separately.

@droctothorpe droctothorpe force-pushed the main branch 3 times, most recently from 60c89c4 to bb0f4fb Compare April 23, 2026 15:30
droctothorpe added a commit to droctothorpe/argo-workflows that referenced this pull request Apr 24, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
droctothorpe added a commit to droctothorpe/argo-workflows that referenced this pull request Apr 26, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
droctothorpe added a commit to droctothorpe/argo-workflows that referenced this pull request Apr 27, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
droctothorpe and others added 11 commits May 13, 2026 21:49
Add a SQL database-backed memoization cache alongside the existing
ConfigMap-based implementation. This includes the database query and
migration layer, controller wiring, auto-derived cache keys based on
inputs and executor fingerprinting, per-entry expiration handling, and
workflow-namespace scoping for cache entries.

The branch also includes the supporting artifact checksum and ETag
plumbing, generated API/docs updates, and follow-up fixes for error
handling, nil-safety, and CI/documentation issues discovered during
review.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
passwordSecret:
name: argo-postgres-config
key: password
memoization: |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added memoization to quick-start-postgres in order to run the e2e sqldb_memoize_test.go test. Let me know if you prefer for that to happen in a dedicated overlay, e.g. quick-start-postgress-with-caching or something to that effect.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to e2e test this on both postgres and mysql. Happy to leave it configured here though, but needs duplicating for mysql.

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
passwordSecret:
name: argo-postgres-config
key: password
memoization: |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to e2e test this on both postgres and mysql. Happy to leave it configured here though, but needs duplicating for mysql.

Comment thread test/e2e/sqldb_memoize_test.go Outdated
}

func (s *SQLDBMemoizeSuite) TestSQLDBMemoize() {
if s.Config.Memoization == nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop this - always test the feature. Dangerous to have the only test for a feature accidentally stop running.

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
@droctothorpe
Copy link
Copy Markdown
Contributor Author

Manual validation evidence

Validate ConfigMap

I deployed AWF on my local orbstack-backed K8s cluster with the changes in this PR.

Setup port-forwarding:

kubectl -n argo port-forward svc/argo-server 2746:2746

I added the create and update verbs to configmap resource in the argo-cluster-role configmap.

I ran the following e2e test:

GOFLAGS='-mod=mod' go test ./test/e2e -tags corefunctional -run 'TestFunctionalSuite/TestStepLevelMemoize$' -count=1 -v

It succeeded:

...
=== RUN   TestFunctionalSuite
=== RUN   TestFunctionalSuite/TestStepLevelMemoize
Submitting workflow  steps-memoize-
Waiting up to 1m0s for workflow with field selector 'metadata.name=steps-memoize-cxvds' and label selector 'workflows.argoproj.io/test'
 ? steps-memoize-cxvds Workflow 0s      

 ● steps-memoize-cxvds   Workflow  0s      
 └ ● steps-memoize-cxvds Steps     0s      
 └ ◷ hello1              Steps     0s      
 └ ● [0]                 StepGroup 0s      
 └ ◷ cache               Pod       0s      
 └ ● [0]                 StepGroup 0s      

 ✔ steps-memoize-cxvds   Workflow  12s     
 └ ✔ steps-memoize-cxvds Steps     12s     
 └ ✔ hello1              Steps     12s     
 └ ✔ [0]                 StepGroup 12s     
 └ ✔ cache               Pod       7s      
 └ ✔ [0]                 StepGroup 12s     
 └ ✔ hello2a             Steps     0s      
 └ ✔ [1]                 StepGroup 0s      

Condition "to be Succeeded" met after 11s
Checking expectation steps-memoize-cxvds
steps-memoize-cxvds : Succeeded 
=== PASS: FunctionalSuite/TestStepLevelMemoize
--- PASS: TestFunctionalSuite (12.24s)
    --- PASS: TestFunctionalSuite/TestStepLevelMemoize (11.95s)
PASS
ok      github.com/argoproj/argo-workflows/v4/test/e2e  14.556s

I verified that a CM was created:

Name:         my-config-memo-step
Namespace:    argo
Labels:       workflows.argoproj.io/configmap-type=Cache
Annotations:  <none>

Data
====
hello1:
----
{"nodeID":"steps-memoize-cxvds-1782520801","outputs":{"parameters":[{"name":"output","value":"hello1"}]},"creationTimestamp":"2026-05-15T02:07:24Z","lastHitTimestamp":"2026-05-15T02:07:24Z"}


BinaryData
====

Events:  <none>

Validate SQL-backed approach

Now that I've verified the default, CM-based approach works as expected, I moved on to the SQL-based approach.

I enabled the SQL-backed approach:

kubectl apply --server-side --force-conflicts \
-f manifests/cluster-install-no-crds/workflow-controller-rbac/workflow-controller-clusterrole.yaml

make install PROFILE=postgres

kubectl -n argo scale deploy/workflow-controller deploy/argo-server --replicas=1
kubectl -n argo set env deploy/argo-server CI_ONLY_DISABLE_ARTIFACT_SERVER_CHECKS=true

kubectl -n argo rollout status deploy/postgres --timeout=300s
kubectl -n argo rollout status deploy/workflow-controller --timeout=300s
kubectl -n argo rollout status deploy/argo-server --timeout=300s

I validated that the sql memoization was enabled in the config by running:

kubectl -n argo get cm workflow-controller-configmap -o jsonpath='{.data.memoization}'; echo
kubectl -n argo logs deploy/workflow-controller --tail=100 | egrep 'Memoization database configuration enabled|Running memoization db migration'

The output was:

❯ kubectl -n argo get cm workflow-controller-configmap -o jsonpath='{.data.memoization}'; echo

tableName: cache_entries
postgresql:
  host: postgres
  port: 5432
  database: postgres
  userNameSecret:
    name: argo-postgres-config
    key: username
  passwordSecret:
    name: argo-postgres-config
    key: password


❯ kubectl -n argo logs deploy/workflow-controller --tail=100 | egrep 'Memoization database configuration enabled|Running memoization db migration'
time=2026-05-15T02:19:06.924Z level=INFO msg="Memoization database configuration enabled" component=configmap_watcher
time=2026-05-15T02:19:06.924Z level=INFO msg="Running memoization db migration" component=configmap_watcher

I port-forwarded the postgres service to local:

kubectl -n argo port-forward svc/postgres 5432:5432 --address localhost

I then kicked off the same e2e workflow as before:

GOFLAGS='-mod=mod' go test ./test/e2e -tags sqldbmemoize -run '^TestSQLDBMemoizeSuite$' -count=1 -v

It succeeded:

❯ argo-workflows main* ⇡ GOFLAGS='-mod=mod' go test ./test/e2e -tags sqldbmemoize -run '^TestSQLDBMemoizeSuite$' -count=1 -v
=== RUN   TestSQLDBMemoizeSuite
=== RUN   TestSQLDBMemoizeSuite/TestSQLDBMemoize
Submitting workflow  sqldb-memoize-
Waiting up to 1m0s for workflow with field selector 'metadata.name=sqldb-memoize-cjrfk' and label selector 'workflows.argoproj.io/test'
 ? sqldb-memoize-cjrfk Workflow 0s      

 ● sqldb-memoize-cjrfk   Workflow  0s      
 └ ● sqldb-memoize-cjrfk Steps     0s      
 └ ◷ run1                Pod       0s      
 └ ● [0]                 StepGroup 0s      

 ✔ sqldb-memoize-cjrfk   Workflow  10s     
 └ ✔ sqldb-memoize-cjrfk Steps     10s     
 └ ✔ run1                Pod       6s      
 └ ✔ [0]                 StepGroup 10s     
 └ ✔ [1]                 StepGroup 0s      
 └ ✔ run2                Pod       0s      

 ✔ sqldb-memoize-cjrfk   Workflow  10s     
 └ ✔ sqldb-memoize-cjrfk Steps     10s     
 └ ✔ run1                Pod       6s      
 └ ✔ [0]                 StepGroup 10s     
 └ ✔ [1]                 StepGroup 0s      
 └ ✔ run2                Pod       0s      

Condition "to be Succeeded" met after 10s
Checking expectation sqldb-memoize-cjrfk
sqldb-memoize-cjrfk : Succeeded 
=== PASS: SQLDBMemoizeSuite/TestSQLDBMemoize
--- PASS: TestSQLDBMemoizeSuite (12.18s)
    --- PASS: TestSQLDBMemoizeSuite/TestSQLDBMemoize (12.02s)
PASS
ok      github.com/argoproj/argo-workflows/v4/test/e2e  12.817s

I ran the following command to verify that the cache key is stored in the postgres database:

kubectl -n argo exec svc/postgres -- \
   psql -U postgres -d postgres \
   -c "SELECT namespace, cache_name, cache_key, created_at, expires_at FROM cache_entries ORDER BY created_at DESC LIMIT 20;"

Here's the output:

 namespace |    cache_name    |            cache_key            |        created_at         |        expires_at         
-----------+------------------+---------------------------------+---------------------------+---------------------------
 argo      | sqldb-memo-cache | hello-sqldb-1778812065286276000 | 2026-05-15 02:27:55.93877 | 2026-05-15 02:37:55.93877
(1 row)

@droctothorpe
Copy link
Copy Markdown
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants