Skip to content

fix: use SSH-reachable management IP for all node SSH operations#324

Closed
remipcomaite wants to merge 5 commits intoPegaProx:mainfrom
remipcomaite:fix/ssh-management-ip-resolution
Closed

fix: use SSH-reachable management IP for all node SSH operations#324
remipcomaite wants to merge 5 commits intoPegaProx:mainfrom
remipcomaite:fix/ssh-management-ip-resolution

Conversation

@remipcomaite
Copy link
Copy Markdown

Problem

(Proxmox API) but is not reachable via SSH when the cluster network is on
a separate VLAN from the management network. This caused SSH shell, SMBIOS
deploy, custom scripts, and node hardening to connect to the wrong IP.

Fix

All SSH operations now resolve the management IP via _get_node_ip(), which:

  • detects the primary node's management interface (VLAN-aware)
  • scores candidate IPs against the same network/VLAN
  • probes the SSH port (not 8006) for reachability

Changes

  • manager.py: _get_node_ip() probes SSH port instead of 8006; STEP 0
    (cluster/status quick path) moved after STEP 1 so primary_network is
    known before accepting a Corosync IP
  • nodes.py: get_node_ip_api (/nodes/<node>/ip, used by SSH shell modal)
  • nodes.py: deploy_smbios_autoconfig_all and run_custom_script
  • nodes.py: get_smbios_autoconfig status/deploy/remove/control
  • auth.py: get_cluster_creds_internal (WebSocket SSH fallback)

Testing

Verified on a cluster with dedicated Corosync VLAN (separate from management
network): SSH shell, SMBIOS deploy, and custom scripts now connect to the
correct management IP.

cluster/status returns the Corosync ring IP which answers on port 8006
but is not necessarily reachable via SSH from the PegaProx server
(separate management vs. cluster VLANs). All SSH operations now resolve
the correct management IP via _get_node_ip(), which scores interfaces,
validates against the primary node's management network, and probes the
SSH port for reachability.

Changes:
- manager.py: _get_node_ip() now probes SSH port (configurable via
  ssh_port, default 22) instead of port 8006; STEP 0 (cluster/status
  quick path) moved after STEP 1 so primary_network is known before
  accepting a Corosync IP
- nodes.py: get_node_ip_api (/nodes/<node>/ip, used by SSH shell modal)
  replaced 3-method manual resolution with _get_node_ip()
- nodes.py: deploy_smbios_autoconfig_all and run_custom_script replaced
  inline cluster/status IP lookup with _get_node_ip()
- nodes.py: get_smbios_autoconfig_status/deploy/remove/control replaced
  inline cluster/status IP lookup with _get_node_ip()
- auth.py: get_cluster_creds_internal (WebSocket SSH fallback) replaced
  cluster/status bulk IP lookup with per-node _get_node_ip()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Use SSH-reachable management IP for all node SSH operations

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Replace cluster/status IP lookups with SSH-aware _get_node_ip() across all SSH operations
• Probe SSH port (configurable, default 22) instead of port 8006 for reachability validation
• Filter Corosync IPs against primary node's management network to avoid separate cluster VLANs
• Reorder IP resolution steps: detect management interface first, then validate cluster/status IPs
Diagram
flowchart LR
  A["SSH Operations<br/>shell, SMBIOS, scripts"] -->|"Previously: cluster/status<br/>or manual resolution"| B["Wrong IP<br/>Corosync/cluster VLAN"]
  A -->|"Now: _get_node_ip()"| C["Step 1: Detect<br/>primary mgmt interface"]
  C --> D["Step 0: Validate<br/>cluster/status IP"]
  D --> E["Step 2-5: Score & probe<br/>candidate IPs on SSH port"]
  D --> F["SSH-reachable<br/>management IP"]
  E --> F
Loading

Grey Divider

File Changes

1. pegaprox/core/manager.py 🐞 Bug fix +52/-34

Reorder IP resolution and probe SSH port for reachability

• Modified _get_node_ip() to probe SSH port (configurable via ssh_port config) instead of port
 8006
• Reordered resolution steps: moved STEP 1 (detect primary management interface) before STEP 0
 (cluster/status quick path)
• Added management network validation for cluster/status IPs to filter out Corosync ring IPs on
 separate VLANs
• Updated all probe calls to use SSH port and added port information to log messages
• Cached primary interface detection per-cluster to avoid repeated API calls

pegaprox/core/manager.py


2. pegaprox/api/nodes.py 🐞 Bug fix +31/-125

Consolidate node IP resolution to use _get_node_ip()

get_node_ip_api(): Replaced 3-method manual IP resolution (cluster/status, network config,
 fallback) with single _get_node_ip() call
• get_smbios_autoconfig_status(): Replaced inline cluster/status lookup with _get_node_ip()deploy_smbios_autoconfig(): Replaced inline cluster/status lookup with _get_node_ip()remove_smbios_autoconfig(): Replaced inline cluster/status lookup with _get_node_ip()control_smbios_autoconfig(): Replaced inline cluster/status lookup with _get_node_ip()deploy_smbios_autoconfig_all(): Removed pre-computed node_ips dict, now calls _get_node_ip()
 per-node
• run_custom_script(): Removed pre-computed node_ips dict, now calls _get_node_ip() per-node
• get_smbios_autoconfig_status_all(): Removed bulk cluster/status IP resolution

pegaprox/api/nodes.py


3. pegaprox/api/auth.py 🐞 Bug fix +15/-20

Use _get_node_ip() for WebSocket SSH fallback credentials

get_cluster_creds_internal(): Replaced bulk cluster/status API call with per-node
 _get_node_ip() resolution
• Removed 3-method manual IP lookup (cluster/status, network config, fallback) in favor of
 centralized _get_node_ip()
• Added logging for successful node IP resolution and warnings for failed resolutions

pegaprox/api/auth.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 18, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Cluster-creds uses missing get_nodes🐞 Bug ≡ Correctness
Description
get_cluster_creds_internal() calls mgr.get_nodes() for Proxmox clusters, but Proxmox codepaths
elsewhere fetch nodes via direct REST calls (not a manager method), so this will throw and cause
cluster-creds to return no per-node IPs (forcing SSH shell to fall back to cluster_host). This
breaks the PR’s goal because the WebSocket shell will often connect to the wrong node when node->IP
mapping is missing.
Code

pegaprox/api/auth.py[R897-905]

+            nodes = mgr.get_nodes() or []
+            for n in nodes:
+                node_name = n.get('node', n.get('name', ''))
+                if not node_name:
+                    continue
+                node_ip = mgr._get_node_ip(node_name)
+                if node_ip:
+                    node_ips[node_name] = node_ip
+                    node_ips[node_name.lower()] = node_ip
Evidence
auth.get_cluster_creds_internal() now depends on mgr.get_nodes(); Proxmox node listing elsewhere
does not use a manager method and explicitly special-cases only XCP-ng for get_nodes(), indicating
Proxmox managers may not implement it. The SSH WebSocket server relies on
/api/internal/cluster-creds for node->IP mapping and falls back to cluster_host when the mapping is
missing, which will defeat per-node IP resolution.

pegaprox/api/auth.py[892-915]
pegaprox/api/clusters.py[295-333]
pegaprox/api/nodes.py[1043-1056]
pegaprox/api/.ssh_ws_server.py[101-155]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`get_cluster_creds_internal()` calls `mgr.get_nodes()` for Proxmox clusters, but Proxmox node enumeration elsewhere uses direct REST (`/api2/json/nodes`) and only XCP-ng is treated as having `get_nodes()`. This likely raises an exception and causes the cluster-creds response to omit per-node IPs, forcing the SSH WebSocket to fall back to `cluster_host`.
### Issue Context
This endpoint is used by the standalone SSH WebSocket server to map a requested `node` to an SSH-reachable management IP.
### Fix Focus Areas
- pegaprox/api/auth.py[894-909]
- pegaprox/api/clusters.py[319-329]
### Proposed fix
For Proxmox clusters, replace `mgr.get_nodes()` usage with the same REST call used in `get_cluster_nodes()` (`GET https://{mgr.host}:8006/api2/json/nodes`), then iterate that response’s `data` items (using their `node` field). Optionally cache the node list on the manager (as `get_cluster_nodes()` already does) to avoid repeated API calls.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Wrong-node SSH fallback🐞 Bug ≡ Correctness
Description
Node SSH operations now use node_ip = mgr._get_node_ip(node) or mgr.host, so when _get_node_ip()
returns None these endpoints will SSH into mgr.host (often the connected/primary node) and run
actions intended for a different node. This can silently deploy SMBIOS services or run custom
scripts on the wrong node and misreport success for the target node.
Code

pegaprox/api/nodes.py[R1548-1552]

for node in nodes_to_run:
-        node_ip = node_ips.get(node, mgr.config.host)
+        node_ip = mgr._get_node_ip(node) or mgr.host
    try:
        ssh = mgr._ssh_connect(node_ip)
        if not ssh:
Evidence
nodes.run_custom_script() falls back to mgr.host if _get_node_ip() returns None, and then
executes the script via SSH on that IP. _get_node_ip() explicitly returns None when it can’t
determine a reachable management IP, so the fallback can route operations to the wrong machine. The
same fallback pattern is used in other node-targeted SSH endpoints (e.g., deploy-all).

pegaprox/api/nodes.py[1548-1556]
pegaprox/api/nodes.py[1159-1168]
pegaprox/core/manager.py[6687-6691]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Several node-targeted SSH endpoints default to `mgr.host` when `_get_node_ip(node)` returns `None`. That can execute actions on the wrong node.
### Issue Context
`_get_node_ip()` already logs and returns `None` when it cannot determine a reachable management IP for the requested node.
### Fix Focus Areas
- pegaprox/api/nodes.py[847-855]
- pegaprox/api/nodes.py[894-903]
- pegaprox/api/nodes.py[946-955]
- pegaprox/api/nodes.py[997-1004]
- pegaprox/api/nodes.py[1159-1168]
- pegaprox/api/nodes.py[1548-1556]
- pegaprox/core/manager.py[6687-6691]
### Proposed fix
1. Replace `mgr._get_node_ip(node) or mgr.host` with:
- `node_ip = mgr._get_node_ip(node)`
- if `not node_ip`: return an error (or record per-node failure in bulk operations).
2. Only allow fallback to `mgr.host` when you *prove* the node is the local/connected node (e.g., using the existing local-node detection logic inside `_get_node_ip` or by checking a cluster/status `local=1` match for that node).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. WebSocket SSH ignores ssh_port🐞 Bug ≡ Correctness
Description
_get_node_ip() now probes reachability on the configured ssh_port, but the standalone SSH
WebSocket server still connects with port 22 for both key and password auth. On clusters configured
with non-22 SSH ports, IP resolution may succeed but shell SSH connections will still fail.
Code

pegaprox/core/manager.py[R6426-6429]

+            # Use the configured SSH port for reachability probes -- port 8006
+            # (Proxmox API) is open on ALL interfaces including cluster/corosync
+            # network, which makes it useless for finding the SSH-reachable IP.
+            ssh_port = getattr(self.config, 'ssh_port', 22) or 22
Evidence
The manager now uses self.config.ssh_port for reachability decisions, but the WebSocket SSH server
hard-codes port=22 when opening the SSH connection. This creates a direct mismatch between
‘reachable’ selection and actual connection behavior when ssh_port != 22.

pegaprox/core/manager.py[6426-6429]
pegaprox/api/.ssh_ws_server.py[233-246]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The SSH WebSocket server always connects to port 22, ignoring cluster-configured `ssh_port`.
### Issue Context
`_get_node_ip()` now uses `config.ssh_port` for reachability probing, so the system is explicitly ssh-port-aware.
### Fix Focus Areas
- pegaprox/api/.ssh_ws_server.py[233-246]
- pegaprox/api/auth.py[919-922]
- pegaprox/core/manager.py[6426-6429]
### Proposed fix
Option A (preferred): extend `/api/internal/cluster-creds/<cluster_id>` to also return `ssh_port`, then have `.ssh_ws_server.py` use it in `ssh.connect(..., port=ssh_port, ...)`.
Option B: have `.ssh_ws_server.py` read `ssh_port` from the same config fallback file it already reads for `cluster_host` (if you don’t want to change the API response).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Cluster-creds can exceed WS timeout🐞 Bug ➹ Performance
Description
get_cluster_creds_internal() now calls _get_node_ip() sequentially for every node, and
_get_node_ip() may perform multiple 2s TCP probes per node, so the endpoint can easily exceed the
SSH WebSocket server’s 10s request timeout. This will degrade shell startup and force fallbacks
(cluster_host/manual IP) even when correct management IPs exist.
Code

pegaprox/core/manager.py[R6619-6628]

+            # STEP 3: Try candidates with connectivity probe (SSH port)
        # ================================================================
        for score, ip, iface, reason in candidates:
            if score < 20 or ip == primary_ip:
                continue
-                if _quick_probe(ip):
-                    self.logger.info(f"[NodeIP] {node_name} -> {ip} (score={score}, {reason}) reachable")
+                if _quick_probe(ip, port=ssh_port):
+                    self.logger.info(f"[NodeIP] {node_name} -> {ip} (score={score}, {reason}) reachable on :{ssh_port}")
                return ip
            else:
-                    self.logger.debug(f"[NodeIP] {node_name}: {ip} score={score} NOT reachable")
+                    self.logger.debug(f"[NodeIP] {node_name}: {ip} score={score} NOT reachable on :{ssh_port}")
Evidence
The WebSocket server requests cluster-creds with a hard 10s timeout. _get_node_ip() uses
_quick_probe(... timeout=2) and loops over candidates probing each; when performed sequentially
for multiple nodes in cluster-creds, worst-case latency can exceed 10s (especially when targets are
offline / filtered out / probes time out).

pegaprox/api/.ssh_ws_server.py[104-107]
pegaprox/api/auth.py[894-909]
pegaprox/core/manager.py[6408-6417]
pegaprox/core/manager.py[6619-6628]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`/api/internal/cluster-creds/<cluster_id>` resolves IPs for all nodes, but `_get_node_ip()` can be slow due to multiple probe attempts with 2s timeouts; this risks exceeding the WS server’s 10s request timeout.
### Issue Context
The WS server only needs the IP for a single requested node to open a shell.
### Fix Focus Areas
- pegaprox/api/auth.py[894-909]
- pegaprox/api/.ssh_ws_server.py[101-155]
- pegaprox/core/manager.py[6408-6417]
- pegaprox/core/manager.py[6619-6628]
### Proposed fix
Change the contract to resolve a single node at a time:
1. Add a node-scoped endpoint (or a query param) like `/api/internal/cluster-creds/<cluster_id>?node=<name>` that returns just `{host, node, ip, ssh_port}`.
2. Update `.ssh_ws_server.py` to request only the needed node.
3. Optionally keep the existing bulk endpoint for UIs that need full mapping, but implement caching and/or parallelization (e.g., `run_concurrent`) and a hard cap on total probe time.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread pegaprox/api/auth.py Outdated
Comment thread pegaprox/api/nodes.py
remipcomaite and others added 4 commits April 18, 2026 12:57
mgr.get_nodes() is not reliable for Proxmox clusters and raised
exceptions, causing get_cluster_creds_internal() to fall back to
cluster_host for all nodes (no per-node IP resolution).

Replace with the same REST call used by get_cluster_nodes()
(GET /api2/json/nodes), reusing _cached_nodes when already populated
to avoid redundant API calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When _get_node_ip() returns None, the previous `or mgr.host` fallback
could silently execute SSH operations on the wrong node.

- Single-node endpoints (smbios status/deploy/remove/control): return
  HTTP 502 with an explicit error message when the IP cannot be resolved
- Bulk endpoints (deploy-all, run_custom_script): record per-node
  failure and continue instead of connecting to mgr.host
- manager.py _get_node_ip(): last-resort host fallback now uses
  self.host (actual connected node, correct in failover) and requires
  node_name to match either the connected host or the configured primary
  before returning it, preventing wrong-node returns on multi-node clusters

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The SSH WebSocket server was hardcoding port=22 in both ssh.connect()
calls, ignoring the per-cluster ssh_port setting already used by
_get_node_ip() for reachability probing.

- auth.py: get_cluster_creds_internal() now includes ssh_port in its
  response (getattr(mgr.config, 'ssh_port', 22) or 22)
- .ssh_ws_server.py: cluster-creds is always fetched (even when the
  frontend pre-fetches the node IP) so ssh_port is always available;
  both ssh.connect() calls use ssh_port instead of the hardcoded 22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tency

_get_node_ip() can take up to N*2s when all SSH probes time out (2s each).
Resolving every node in the cluster on each shell open risks exceeding the
WebSocket server's 10s request timeout.

- auth.py: get_cluster_creds_internal() accepts ?node=<name>; when set,
  only that node is resolved via _get_node_ip() (single-node fast path).
  The bulk path (no query param) is kept for UIs needing full mapping.
- .ssh_ws_server.py: passes ?node={node} so the endpoint resolves only
  the requested node instead of the whole cluster.
- manager.py: STEP 3 probe loop capped at 3 candidates (sorted by score
  desc) to bound worst-case latency to 3*2s=6s instead of N*2s.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MrMasterbay added a commit that referenced this pull request Apr 19, 2026
- manager._get_node_ip: reorder STEP 1 (detect primary mgmt iface) before
  STEP 0 (cluster/status quick path) so corosync ring IPs on separate VLANs
  are filtered against primary_network
- manager._get_node_ip: probe on ssh_port (not 8006 -- pveproxy listens on
  every bridge incl. corosync, so 8006 reachability doesn't prove SSH works)
- manager._get_node_ip: cap STEP 3 probes at 3 -> bounded 6s worst-case,
  stays under the WS server's 10s cluster-creds timeout
- manager._get_node_ip: tighten final host fallback (only when node_name
  matches the connected host, so multi-node clusters can't accidentally SSH
  into the wrong box)
- auth.get_cluster_creds_internal: add ?node=<name> single-node path so the
  SSH shell resolves just the node it needs, not every node
- auth.get_cluster_creds_internal: switch Proxmox path to _get_node_ip per
  node (with REST /api2/json/nodes enumeration, not mgr.get_nodes which is
  XCP-ng only) and add ssh_port to the response
- nodes.py: 7 endpoints (get_node_ip_api, 4x SMBIOS single-node, SMBIOS
  deploy-all, custom scripts, SMBIOS status-all) replace inline
  cluster/status hacks with _get_node_ip -- per-node errors surface instead
  of silently SSHing into mgr.host

Inspired by PR #324 by @remipcomaite; needed to be re-landed on top of the
overlapping v0.9.0.2 edits to manager.py. WS SSH shell still uses port 22
for the connect call itself; the ssh_port field is exposed for when that
changes.
@MrMasterbay
Copy link
Copy Markdown
Contributor

Thanks a lot for digging into this @remipcomaite — the VLAN issue you describe is real, and your analysis of why cluster/status returns corosync IPs matched exactly what we were seeing on a couple of customer clusters.

I landed the core ideas directly on main in e59bbb5 instead of merging here, because v0.9.0.2 had shipped overlapping edits to manager._get_node_ip() so the PR branch had drifted into conflicts. What went in:

  • STEP 1 (detect primary mgmt iface) now runs before STEP 0 (cluster/status quick path), so corosync ring IPs are filtered against primary_network
  • reachability probes target ssh_port (not 8006 — pveproxy listens everywhere so 8006 proves nothing)
  • candidate probes capped at 3 so worst-case latency stays under the WS server's 10s timeout
  • cluster-creds gained ?node=<name> single-node resolution + ssh_port in the response
  • SMBIOS/custom-script endpoints now error cleanly instead of falling back to mgr.host when _get_node_ip() returns None

The Qodo review on your PR called out four "bugs"; three of them (get_nodes, WS ignoring ssh_port, perf timeout) were actually already addressed inside your PR itself — the bot seems to have read an earlier revision. Sorry about the noise from that.

Closing this one in favor of the squashed commit. Thanks again for the work!

mkellermann97 added a commit that referenced this pull request Apr 19, 2026
- SSH reachability overhaul for corosync-VLAN setups (#324) — every
  node SSH op now resolves the real mgmt IP, probe is on ssh_port,
  single-node ?node= path in cluster-creds, 3-probe cap to stay under
  WS timeout, safer final host fallback. Adopted from remipcomaite
- Node Hardening PDF/PNG export — CIS / Lynis / STIG / PegaProx
  audit report with stats, per-source tables, optional verbose
  evidence section
- KSM Sharing visible in Node Summary, always shown (matches native
  PVE UI); backend normalizes shared to an int
- CVE Scanner: failed-scan nodes no longer show as green '0 CVEs' —
  grey dash with failed count instead
- CVE Scanner: breathing room in corporate layout (taller stat
  cards, bigger gap, label spacing)
- Rolling Update: reboot_timeout now exposed in the UI next to
  evacuation_timeout — useful for Ceph OSDs / slow-boot nodes (#328)
- Corporate dashboard: single-node clusters show 'Standalone'
  badge instead of red 'Quorum verloren' (#326)
- Disk Create modal: native select dropdowns no longer dismiss the
  modal on click (#323)
- Translations: verbose audit output, KSM sharing, reboot timeout,
  several DE/EN/FR/ES/PT/KO keys
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants