Skip to content

[Bug] HeadNode checks all compute nodes when adding queue, causing timeout with running jobs #7203

@almightychang

Description

@almightychang

Required Info:

  • AWS ParallelCluster version: 3.14.0
  • Cluster name: pcluster-prod
  • Region: us-east-2
  • Output of pcluster describe-cluster command:
{
  "clusterName": "pcluster-prod",
  "version": "3.14.0",
  "clusterStatus": "UPDATE_FAILED",
  "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

Bug description and how to reproduce:

When adding a new SLURM queue to an existing ParallelCluster, the cluster update fails with HeadNodeWaitCondition timeout after 35 minutes. The root cause is that the HeadNode readiness check validates cluster_config_version for all existing compute nodes, not just nodes in the newly added queue.

Root Cause

The HeadNode runs /opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py which calls check_deployed_config_version(). This function queries all compute/login nodes in the cluster:

for instance_ids in list_cluster_instance_ids_iterator(
    cluster_name=cluster_name,
    node_type=["Compute", "LoginNode"],  # ← Checks ALL nodes
    instance_state=["running"],
    region=region,
):

Problem: When adding a new queue:

  • New config version is generated (e.g., iGGLMANbWzWVLBzgVcfpxqP8UHsEMowr)
  • Existing nodes in other queues still have old version (e.g., x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX)
  • These nodes have running jobs (18+ hours of compute) and were never updated
  • TERMINATE strategy does NOT terminate nodes when only adding queues (only applies to modified queues)
  • Result: Version mismatch → readiness check fails → timeout → rollback

Steps to Reproduce

  1. Create a cluster with existing queues that have running compute nodes:
SlurmQueues:
  - Name: existing-queue
    ComputeResources:
      - Name: compute-resource
        InstanceType: p5e.48xlarge
        MinCount: 2
        MaxCount: 10
  1. Submit long-running jobs (12+ hours) to existing queue

  2. While jobs are running, add a NEW queue to config.yaml:

SlurmQueues:
  # ... existing queues unchanged ...
  - Name: new-queue  # ← NEW QUEUE
    CapacityType: ONDEMAND
    ComputeResources:
      - Name: new-compute
        InstanceType: g6e.12xlarge
        MinCount: 0
        MaxCount: 4
  1. Run pcluster update-cluster --cluster-name pcluster-prod --cluster-configuration config.yaml

  2. Observe:

    • CloudFormation ComputeFleet update succeeds (new queue created)
    • HeadNode readiness check fails with:
      CheckFailedError: Check failed due to the following erroneous records:
        * wrong records (7): [
            ('i-0cb03846fa80df86f', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
            ...
          ]
      
    • After 35 minutes: HeadNodeWaitCondition timeout
    • CloudFormation rolls back to previous version

Expected Behavior

When adding a new queue to an existing cluster:

  1. New queue should be added without affecting existing queues
  2. Existing compute nodes should NOT need to be updated
  3. Config version check should either:
    • Only check nodes in modified queues, OR
    • Skip version check when only adding queues

Actual Behavior

  • HeadNode checks config_version for all compute nodes regardless of which queues changed
  • Adding a queue changes the global config version
  • Existing nodes retain old config version since they weren't updated
  • Readiness check fails → timeout → rollback

Impact

  • Cannot add new queues to a cluster with running jobs
  • Must wait for all jobs to complete or manually terminate them
  • Defeats the purpose of QueueUpdateStrategy: DRAIN which is meant to allow updates without interrupting jobs
  • Prevents elastic capacity scaling during production workloads

Evidence

From HeadNode /var/log/chef-client.log:

File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 116, in check_deployed_config_version
  raise CheckFailedError(
common.exceptions.CheckFailedError: Check failed due to the following erroneous records:
  * missing records (0): []
  * incomplete records (0): []
  * wrong records (7): [
      ('i-0cb03846fa80df86f', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-0636cf08ef148ccc9', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-06c7a230de8c7e2a1', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-0ed7b4f23f5e71954', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-03bd69018b354ad0e', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-05c4f92e4e3ed0732', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-0ce324d164ac31a1c', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX')
    ]
[2026-01-21T07:45:32+00:00] INFO: Retrying execution of execute[Check cluster readiness], 7 attempts left

Affected Nodes (7 nodes across 4 different queues, all with running jobs):

  • Queue alinlab-gpu-2c: 1 node, 2 jobs
  • Queue rlwrld-gpu: 2 nodes, 2 jobs
  • Queue rlwrld-cpu: 1 node, 1 job
  • Queue alinlab-gpu-2b: 3 nodes, 3 jobs
  • Total: 8 running jobs, longest running 18.5 hours

None of these queues were modified - only a new queue was added.

Workaround

Manually update cluster_config_version in DynamoDB for existing nodes:

# 1. Get old and new config versions from HeadNode logs
OLD_VERSION="x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX"
NEW_VERSION="iGGLMANbWzWVLBzgVcfpxqP8UHsEMowr"

# 2. List affected instance IDs from error logs
INSTANCE_IDS=(
  i-0cb03846fa80df86f
  i-0636cf08ef148ccc9
  i-06c7a230de8c7e2a1
  i-0ed7b4f23f5e71954
  i-03bd69018b354ad0e
  i-05c4f92e4e3ed0732
  i-0ce324d164ac31a1c
)

# 3. Update config_version in DynamoDB
for instance_id in "${INSTANCE_IDS[@]}"; do
  aws dynamodb update-item \
    --table-name parallelcluster-pcluster-prod \
    --region us-east-2 \
    --key "{\"Id\":{\"S\":\"CLUSTER_CONFIG.$instance_id\"}}" \
    --update-expression "SET #data.#cv = :newver, #data.#time = :time" \
    --expression-attribute-names '{"#data":"Data","#cv":"cluster_config_version","#time":"lastUpdateTime"}' \
    --expression-attribute-values "{\":newver\":{\"S\":\"$NEW_VERSION\"},\":time\":{\"S\":\"$(date -u '+%Y-%m-%d %H:%M:%S UTC')\"}}"
done

# 4. Wait for HeadNode to complete readiness check (~2 minutes)

Result: Cluster update completed successfully, jobs continued running without interruption.

Safety: This workaround is safe when:

  • Only adding new queues (no modifications to existing queues)
  • No CustomActions or IAM policy changes affecting existing nodes
  • Essentially updating metadata only, not actual node configuration

Proposed Fix

Option A: Modify check_deployed_config_version() to only check nodes in modified queues

def check_deployed_config_version(
    cluster_name: str, 
    table_name: str, 
    expected_config_version: str, 
    region: str,
    modified_queues: Optional[List[str]] = None  # ← NEW PARAMETER
):
    if modified_queues:
        # Only check nodes in specified queues
        # Filter by tag:parallelcluster:queue-name
    else:
        # Check all nodes (backward compatible)

Option B: Skip version check when only adding queues (no modifications)

def check_deployed_config_version(..., change_type: Optional[str] = None):
    if change_type == 'QUEUE_ADDITION':
        logger.info("Skipping config version check for queue addition")
        return

Option C: Add configuration option to skip readiness check

DevSettings:
  SkipConfigVersionCheckOnQueueAddition: true

Infrastructure Available

ParallelCluster already tags instances with queue names:

# cli/src/pcluster/constants.py
PCLUSTER_QUEUE_NAME_TAG = "parallelcluster:queue-name"

The tag exists but check_cluster_ready.py doesn't use it for filtering.

Related Issues

Environment Details

  • ParallelCluster Version: 3.14.0
  • Region: us-east-2
  • Scheduler: SLURM
  • QueueUpdateStrategy: TERMINATE (changing to DRAIN doesn't help)
  • Date: 2026-01-21

This appears to be a design issue where the readiness check assumes a monolithic cluster configuration rather than supporting incremental queue additions. A fix would enable truly elastic HPC clusters that can scale capacity without disrupting running workloads.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions