Skip to content

fix(runners): wire job_retry.lambda_memory_size and lambda_timeout#5120

Open
oscarbc96 wants to merge 1 commit intogithub-aws-runners:mainfrom
oscarbc96:fix/job-retry-lambda-memory-and-timeout
Open

fix(runners): wire job_retry.lambda_memory_size and lambda_timeout#5120
oscarbc96 wants to merge 1 commit intogithub-aws-runners:mainfrom
oscarbc96:fix/job-retry-lambda-memory-and-timeout

Conversation

@oscarbc96
Copy link
Copy Markdown

@oscarbc96 oscarbc96 commented May 10, 2026

Description

Both var.job_retry (in modules/multi-runner/variables.tf and modules/runners/variables.tf) declare lambda_memory_size and lambda_timeout as documented configuration fields, but local.job_retry in modules/runners/job-retry.tf never copies either field into the config map passed to the inner job-retry / lambda sub-modules. The inner lambda module then falls back to its defaults (memory_size = 256, timeout = 60), so user-supplied values are silently dropped — tofu plan shows no diff and the running Lambda keeps its defaults.

The fix is a two-line addition to the local.job_retry map. It mirrors the pattern modules/runners/ssm-housekeeper.tf already uses for local.ssm_housekeeper.lambda_memory_size / local.ssm_housekeeper.lambda_timeout — that Lambda correctly threads the values through.

Motivation

Discovered in production: I pinned lambda_memory_size = 512 in multi_runner_config[*].runner_config.job_retry after observing the job-retry Lambdas at 87% memory utilisation (223 MB peak on the 256 MB default), and got No changes from tofu plan. Tracing the wiring confirmed the value never reaches the resource.

Reproduction

module "runners" {
  source  = "github-aws-runners/github-runner/aws//modules/multi-runner"
  version = "7.6.0"
  #
  multi_runner_config = {
    "example" = {
      matcherConfig = { … }
      runner_config = merge(local.default_config, {
        # … other config …
        job_retry = {
          enable             = true
          lambda_memory_size = 512  # ← silently ignored before this PR
          lambda_timeout     = 60   # ← silently ignored before this PR
        }
      })
    }
  }
}

After this fix, tofu plan shows the expected memory_size: 256 -> 512 change on the job-retry Lambda.

Verification

  • tofu fmt clean.
  • The variable type definition on both modules/runners/variables.tf and modules/multi-runner/variables.tf already declares lambda_memory_size = optional(number, 256) and lambda_timeout = optional(number, 30), so no public surface changes.
  • The inner modules/runners/modules/lambda accepts memory_size and timeout on its lambda input object (with the same defaults), so when the wiring is restored the values flow through naturally.

No-impact when not set

Defaults remain memory_size = 256 (per the variable declaration) and timeout = 30 — same as today's effective values when nothing is overridden.

The job_retry variable on both the multi-runner and runners modules
declares lambda_memory_size and lambda_timeout, but the
local.job_retry map in modules/runners/job-retry.tf never copied
either field into the config passed to the inner job-retry / lambda
sub-modules. The inner lambda module fell back to its defaults
(memory_size = 256, timeout = 60), so user-supplied values were
silently dropped.

Mirrors the pattern already used by ssm-housekeeper.tf
(local.ssm_housekeeper.lambda_memory_size /
local.ssm_housekeeper.lambda_timeout) — the ssm-housekeeper Lambda
correctly threads the values through; the job-retry one didn't.

Observed in production: a deployment pinned to lambda_memory_size = 512
in multi_runner_config[*].runner_config.job_retry produced no plan
diff because the value never reached the resource. The job-retry
Lambdas were OOM-adjacent at 87% memory utilisation (223 MB peak on
the 256 MB default) on a fleet of three runners.
@oscarbc96 oscarbc96 requested a review from a team as a code owner May 10, 2026 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant