fix: use SSH-reachable management IP for all node SSH operations#324
fix: use SSH-reachable management IP for all node SSH operations#324remipcomaite wants to merge 5 commits intoPegaProx:mainfrom
Conversation
cluster/status returns the Corosync ring IP which answers on port 8006 but is not necessarily reachable via SSH from the PegaProx server (separate management vs. cluster VLANs). All SSH operations now resolve the correct management IP via _get_node_ip(), which scores interfaces, validates against the primary node's management network, and probes the SSH port for reachability. Changes: - manager.py: _get_node_ip() now probes SSH port (configurable via ssh_port, default 22) instead of port 8006; STEP 0 (cluster/status quick path) moved after STEP 1 so primary_network is known before accepting a Corosync IP - nodes.py: get_node_ip_api (/nodes/<node>/ip, used by SSH shell modal) replaced 3-method manual resolution with _get_node_ip() - nodes.py: deploy_smbios_autoconfig_all and run_custom_script replaced inline cluster/status IP lookup with _get_node_ip() - nodes.py: get_smbios_autoconfig_status/deploy/remove/control replaced inline cluster/status IP lookup with _get_node_ip() - auth.py: get_cluster_creds_internal (WebSocket SSH fallback) replaced cluster/status bulk IP lookup with per-node _get_node_ip() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review Summary by QodoUse SSH-reachable management IP for all node SSH operations
WalkthroughsDescription• Replace cluster/status IP lookups with SSH-aware _get_node_ip() across all SSH operations • Probe SSH port (configurable, default 22) instead of port 8006 for reachability validation • Filter Corosync IPs against primary node's management network to avoid separate cluster VLANs • Reorder IP resolution steps: detect management interface first, then validate cluster/status IPs Diagramflowchart LR
A["SSH Operations<br/>shell, SMBIOS, scripts"] -->|"Previously: cluster/status<br/>or manual resolution"| B["Wrong IP<br/>Corosync/cluster VLAN"]
A -->|"Now: _get_node_ip()"| C["Step 1: Detect<br/>primary mgmt interface"]
C --> D["Step 0: Validate<br/>cluster/status IP"]
D --> E["Step 2-5: Score & probe<br/>candidate IPs on SSH port"]
D --> F["SSH-reachable<br/>management IP"]
E --> F
File Changes1. pegaprox/core/manager.py
|
Code Review by Qodo
1.
|
mgr.get_nodes() is not reliable for Proxmox clusters and raised exceptions, causing get_cluster_creds_internal() to fall back to cluster_host for all nodes (no per-node IP resolution). Replace with the same REST call used by get_cluster_nodes() (GET /api2/json/nodes), reusing _cached_nodes when already populated to avoid redundant API calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When _get_node_ip() returns None, the previous `or mgr.host` fallback could silently execute SSH operations on the wrong node. - Single-node endpoints (smbios status/deploy/remove/control): return HTTP 502 with an explicit error message when the IP cannot be resolved - Bulk endpoints (deploy-all, run_custom_script): record per-node failure and continue instead of connecting to mgr.host - manager.py _get_node_ip(): last-resort host fallback now uses self.host (actual connected node, correct in failover) and requires node_name to match either the connected host or the configured primary before returning it, preventing wrong-node returns on multi-node clusters Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The SSH WebSocket server was hardcoding port=22 in both ssh.connect() calls, ignoring the per-cluster ssh_port setting already used by _get_node_ip() for reachability probing. - auth.py: get_cluster_creds_internal() now includes ssh_port in its response (getattr(mgr.config, 'ssh_port', 22) or 22) - .ssh_ws_server.py: cluster-creds is always fetched (even when the frontend pre-fetches the node IP) so ssh_port is always available; both ssh.connect() calls use ssh_port instead of the hardcoded 22 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tency
_get_node_ip() can take up to N*2s when all SSH probes time out (2s each).
Resolving every node in the cluster on each shell open risks exceeding the
WebSocket server's 10s request timeout.
- auth.py: get_cluster_creds_internal() accepts ?node=<name>; when set,
only that node is resolved via _get_node_ip() (single-node fast path).
The bulk path (no query param) is kept for UIs needing full mapping.
- .ssh_ws_server.py: passes ?node={node} so the endpoint resolves only
the requested node instead of the whole cluster.
- manager.py: STEP 3 probe loop capped at 3 candidates (sorted by score
desc) to bound worst-case latency to 3*2s=6s instead of N*2s.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- manager._get_node_ip: reorder STEP 1 (detect primary mgmt iface) before STEP 0 (cluster/status quick path) so corosync ring IPs on separate VLANs are filtered against primary_network - manager._get_node_ip: probe on ssh_port (not 8006 -- pveproxy listens on every bridge incl. corosync, so 8006 reachability doesn't prove SSH works) - manager._get_node_ip: cap STEP 3 probes at 3 -> bounded 6s worst-case, stays under the WS server's 10s cluster-creds timeout - manager._get_node_ip: tighten final host fallback (only when node_name matches the connected host, so multi-node clusters can't accidentally SSH into the wrong box) - auth.get_cluster_creds_internal: add ?node=<name> single-node path so the SSH shell resolves just the node it needs, not every node - auth.get_cluster_creds_internal: switch Proxmox path to _get_node_ip per node (with REST /api2/json/nodes enumeration, not mgr.get_nodes which is XCP-ng only) and add ssh_port to the response - nodes.py: 7 endpoints (get_node_ip_api, 4x SMBIOS single-node, SMBIOS deploy-all, custom scripts, SMBIOS status-all) replace inline cluster/status hacks with _get_node_ip -- per-node errors surface instead of silently SSHing into mgr.host Inspired by PR #324 by @remipcomaite; needed to be re-landed on top of the overlapping v0.9.0.2 edits to manager.py. WS SSH shell still uses port 22 for the connect call itself; the ssh_port field is exposed for when that changes.
|
Thanks a lot for digging into this @remipcomaite — the VLAN issue you describe is real, and your analysis of why cluster/status returns corosync IPs matched exactly what we were seeing on a couple of customer clusters. I landed the core ideas directly on main in e59bbb5 instead of merging here, because v0.9.0.2 had shipped overlapping edits to
The Qodo review on your PR called out four "bugs"; three of them ( Closing this one in favor of the squashed commit. Thanks again for the work! |
- SSH reachability overhaul for corosync-VLAN setups (#324) — every node SSH op now resolves the real mgmt IP, probe is on ssh_port, single-node ?node= path in cluster-creds, 3-probe cap to stay under WS timeout, safer final host fallback. Adopted from remipcomaite - Node Hardening PDF/PNG export — CIS / Lynis / STIG / PegaProx audit report with stats, per-source tables, optional verbose evidence section - KSM Sharing visible in Node Summary, always shown (matches native PVE UI); backend normalizes shared to an int - CVE Scanner: failed-scan nodes no longer show as green '0 CVEs' — grey dash with failed count instead - CVE Scanner: breathing room in corporate layout (taller stat cards, bigger gap, label spacing) - Rolling Update: reboot_timeout now exposed in the UI next to evacuation_timeout — useful for Ceph OSDs / slow-boot nodes (#328) - Corporate dashboard: single-node clusters show 'Standalone' badge instead of red 'Quorum verloren' (#326) - Disk Create modal: native select dropdowns no longer dismiss the modal on click (#323) - Translations: verbose audit output, KSM sharing, reboot timeout, several DE/EN/FR/ES/PT/KO keys
Problem
(Proxmox API) but is not reachable via SSH when the cluster network is on
a separate VLAN from the management network. This caused SSH shell, SMBIOS
deploy, custom scripts, and node hardening to connect to the wrong IP.
Fix
All SSH operations now resolve the management IP via
_get_node_ip(), which:Changes
manager.py:_get_node_ip()probes SSH port instead of 8006; STEP 0(cluster/status quick path) moved after STEP 1 so
primary_networkisknown before accepting a Corosync IP
nodes.py:get_node_ip_api(/nodes/<node>/ip, used by SSH shell modal)nodes.py:deploy_smbios_autoconfig_allandrun_custom_scriptnodes.py:get_smbios_autoconfigstatus/deploy/remove/controlauth.py:get_cluster_creds_internal(WebSocket SSH fallback)Testing
Verified on a cluster with dedicated Corosync VLAN (separate from management
network): SSH shell, SMBIOS deploy, and custom scripts now connect to the
correct management IP.