[Bug] HeadNode checks all compute nodes when adding queue, causing timeout with running jobs

**Required Info:**
 - AWS ParallelCluster version: **3.14.0**
 - Cluster name: **pcluster-prod**
 - Region: **us-east-2**
 - Output of `pcluster describe-cluster` command:
```json
{
  "clusterName": "pcluster-prod",
  "version": "3.14.0",
  "clusterStatus": "UPDATE_FAILED",
  "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}
```

**Bug description and how to reproduce:**

When adding a new SLURM queue to an existing ParallelCluster, the cluster update fails with `HeadNodeWaitCondition` timeout after 35 minutes. The root cause is that the HeadNode readiness check validates `cluster_config_version` for **all existing compute nodes**, not just nodes in the newly added queue.

### Root Cause

The HeadNode runs `/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py` which calls `check_deployed_config_version()`. This function queries **all compute/login nodes** in the cluster:

```python
for instance_ids in list_cluster_instance_ids_iterator(
    cluster_name=cluster_name,
    node_type=["Compute", "LoginNode"],  # ← Checks ALL nodes
    instance_state=["running"],
    region=region,
):
```

**Problem**: When adding a new queue:
- New config version is generated (e.g., `iGGLMANbWzWVLBzgVcfpxqP8UHsEMowr`)
- Existing nodes in **other queues** still have old version (e.g., `x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX`)
- These nodes have running jobs (18+ hours of compute) and were never updated
- `TERMINATE` strategy does NOT terminate nodes when only adding queues (only applies to modified queues)
- Result: Version mismatch → readiness check fails → timeout → rollback

### Steps to Reproduce

1. Create a cluster with existing queues that have running compute nodes:
```yaml
SlurmQueues:
  - Name: existing-queue
    ComputeResources:
      - Name: compute-resource
        InstanceType: p5e.48xlarge
        MinCount: 2
        MaxCount: 10
```

2. Submit long-running jobs (12+ hours) to existing queue

3. While jobs are running, add a NEW queue to config.yaml:
```yaml
SlurmQueues:
  # ... existing queues unchanged ...
  - Name: new-queue  # ← NEW QUEUE
    CapacityType: ONDEMAND
    ComputeResources:
      - Name: new-compute
        InstanceType: g6e.12xlarge
        MinCount: 0
        MaxCount: 4
```

4. Run `pcluster update-cluster --cluster-name pcluster-prod --cluster-configuration config.yaml`

5. Observe:
   - CloudFormation ComputeFleet update succeeds (new queue created)
   - HeadNode readiness check fails with:
     ```
     CheckFailedError: Check failed due to the following erroneous records:
       * wrong records (7): [
           ('i-0cb03846fa80df86f', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
           ...
         ]
     ```
   - After 35 minutes: `HeadNodeWaitCondition` timeout
   - CloudFormation rolls back to previous version

### Expected Behavior

When adding a new queue to an existing cluster:
1. New queue should be added without affecting existing queues
2. Existing compute nodes should NOT need to be updated
3. Config version check should either:
   - Only check nodes in modified queues, OR
   - Skip version check when only adding queues

### Actual Behavior

- HeadNode checks config_version for **all compute nodes** regardless of which queues changed
- Adding a queue changes the global config version
- Existing nodes retain old config version since they weren't updated
- Readiness check fails → timeout → rollback

### Impact

- **Cannot add new queues** to a cluster with running jobs
- Must wait for all jobs to complete or manually terminate them
- Defeats the purpose of `QueueUpdateStrategy: DRAIN` which is meant to allow updates without interrupting jobs
- Prevents elastic capacity scaling during production workloads

### Evidence

**From HeadNode `/var/log/chef-client.log`:**
```
File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 116, in check_deployed_config_version
  raise CheckFailedError(
common.exceptions.CheckFailedError: Check failed due to the following erroneous records:
  * missing records (0): []
  * incomplete records (0): []
  * wrong records (7): [
      ('i-0cb03846fa80df86f', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-0636cf08ef148ccc9', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-06c7a230de8c7e2a1', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-0ed7b4f23f5e71954', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-03bd69018b354ad0e', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-05c4f92e4e3ed0732', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-0ce324d164ac31a1c', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX')
    ]
[2026-01-21T07:45:32+00:00] INFO: Retrying execution of execute[Check cluster readiness], 7 attempts left
```

**Affected Nodes** (7 nodes across 4 different queues, all with running jobs):
- Queue `alinlab-gpu-2c`: 1 node, 2 jobs
- Queue `rlwrld-gpu`: 2 nodes, 2 jobs  
- Queue `rlwrld-cpu`: 1 node, 1 job
- Queue `alinlab-gpu-2b`: 3 nodes, 3 jobs
- **Total**: 8 running jobs, longest running 18.5 hours

**None of these queues were modified** - only a new queue was added.

### Workaround

Manually update `cluster_config_version` in DynamoDB for existing nodes:

```bash
# 1. Get old and new config versions from HeadNode logs
OLD_VERSION="x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX"
NEW_VERSION="iGGLMANbWzWVLBzgVcfpxqP8UHsEMowr"

# 2. List affected instance IDs from error logs
INSTANCE_IDS=(
  i-0cb03846fa80df86f
  i-0636cf08ef148ccc9
  i-06c7a230de8c7e2a1
  i-0ed7b4f23f5e71954
  i-03bd69018b354ad0e
  i-05c4f92e4e3ed0732
  i-0ce324d164ac31a1c
)

# 3. Update config_version in DynamoDB
for instance_id in "${INSTANCE_IDS[@]}"; do
  aws dynamodb update-item \
    --table-name parallelcluster-pcluster-prod \
    --region us-east-2 \
    --key "{\"Id\":{\"S\":\"CLUSTER_CONFIG.$instance_id\"}}" \
    --update-expression "SET #data.#cv = :newver, #data.#time = :time" \
    --expression-attribute-names '{"#data":"Data","#cv":"cluster_config_version","#time":"lastUpdateTime"}' \
    --expression-attribute-values "{\":newver\":{\"S\":\"$NEW_VERSION\"},\":time\":{\"S\":\"$(date -u '+%Y-%m-%d %H:%M:%S UTC')\"}}"
done

# 4. Wait for HeadNode to complete readiness check (~2 minutes)
```

**Result**: Cluster update completed successfully, jobs continued running without interruption.

**Safety**: This workaround is safe when:
- Only adding new queues (no modifications to existing queues)
- No CustomActions or IAM policy changes affecting existing nodes
- Essentially updating metadata only, not actual node configuration

### Proposed Fix

**Option A**: Modify `check_deployed_config_version()` to only check nodes in modified queues

```python
def check_deployed_config_version(
    cluster_name: str, 
    table_name: str, 
    expected_config_version: str, 
    region: str,
    modified_queues: Optional[List[str]] = None  # ← NEW PARAMETER
):
    if modified_queues:
        # Only check nodes in specified queues
        # Filter by tag:parallelcluster:queue-name
    else:
        # Check all nodes (backward compatible)
```

**Option B**: Skip version check when only adding queues (no modifications)

```python
def check_deployed_config_version(..., change_type: Optional[str] = None):
    if change_type == 'QUEUE_ADDITION':
        logger.info("Skipping config version check for queue addition")
        return
```

**Option C**: Add configuration option to skip readiness check

```yaml
DevSettings:
  SkipConfigVersionCheckOnQueueAddition: true
```

### Infrastructure Available

ParallelCluster already tags instances with queue names:
```python
# cli/src/pcluster/constants.py
PCLUSTER_QUEUE_NAME_TAG = "parallelcluster:queue-name"
```

The tag exists but `check_cluster_ready.py` doesn't use it for filtering.

### Related Issues

- Similar issue reported in #7166 (ghost records during update)
- Related to #4286 (cluster update failure when adding queue, different root cause)

### Environment Details

- **ParallelCluster Version**: 3.14.0
- **Region**: us-east-2
- **Scheduler**: SLURM
- **QueueUpdateStrategy**: `TERMINATE` (changing to `DRAIN` doesn't help)
- **Date**: 2026-01-21

This appears to be a design issue where the readiness check assumes a monolithic cluster configuration rather than supporting incremental queue additions. A fix would enable truly elastic HPC clusters that can scale capacity without disrupting running workloads.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] HeadNode checks all compute nodes when adding queue, causing timeout with running jobs #7203

Root Cause

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Evidence

Workaround

Proposed Fix

Infrastructure Available

Related Issues

Environment Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] HeadNode checks all compute nodes when adding queue, causing timeout with running jobs #7203

Description

Root Cause

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Evidence

Workaround

Proposed Fix

Infrastructure Available

Related Issues

Environment Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions