Skip to content

Ssh fix#949

Open
PraveenPenguin wants to merge 11 commits into
open-power:masterfrom
PraveenPenguin:ssh_fix
Open

Ssh fix#949
PraveenPenguin wants to merge 11 commits into
open-power:masterfrom
PraveenPenguin:ssh_fix

Conversation

@PraveenPenguin
Copy link
Copy Markdown
Collaborator

No description provided.

@PraveenPenguin PraveenPenguin marked this pull request as draft April 28, 2026 16:12
@PraveenPenguin
Copy link
Copy Markdown
Collaborator Author

@vrbagalkote, this is the initial patchset to enable ssh way to instead of console .. even it has fallback mechanism if OS not up (ssh is not working )

@PraveenPenguin PraveenPenguin force-pushed the ssh_fix branch 9 times, most recently from 1de7a4a to a34f6c6 Compare April 30, 2026 07:23
@PraveenPenguin
Copy link
Copy Markdown
Collaborator Author

LOGS : https://pastebin.com/nMfqRQpK

Add new SSH-based communication layer for op-test framework:
- OpTestSSHConnection: Direct SSH connection management
- OpTestCommandExecutor: Command execution with timeout handling
- OpTestConnectionManager: Unified connection management
- Update Exceptions.py with SSH exception classes
- Integrate into OpTestHost for seamless operation
- Add paramiko dependency for SSH support

This provides the foundation for SSH-first architecture,
enabling more reliable communication with HMC/LPAR systems.

Signed-off-by: Praveen K Pandey <praveen@linux.vnet.ibm.com>
Add migration script to help transition test files to new SSH architecture.
Includes automated test file migration capabilities.

Signed-off-by: Praveen K Pandey <praveen@linux.vnet.ibm.com>
Enable direct SSH command execution for HMC operations.
Implement SSH-first approach for system state management.
Make HMC console connection lazy-loaded only when needed.

Signed-off-by: Praveen K Pandey <praveen@linux.vnet.ibm.com>
Update OpTestSysinfo to use SSH for data collection.
Improve sysinfo output formatting and command output display.

Signed-off-by: Praveen K Pandey <praveen@linux.vnet.ibm.com>
Change SSH prompt from console-expect to ssh-expect for clarity.
Initialize SSH pty automatically when accessed.
Handle SSH connection failures during HMC initialization gracefully.

Signed-off-by: Praveen K Pandey <praveen@linux.vnet.ibm.com>
Update test files to use new SSH architecture for improved reliability.

Signed-off-by: Praveen K Pandey <praveen@linux.vnet.ibm.com>
@PraveenPenguin PraveenPenguin force-pushed the ssh_fix branch 5 times, most recently from 3b0baae to 0055d0e Compare May 14, 2026 15:22
@abdhaleegit
Copy link
Copy Markdown
Collaborator

@PraveenPenguin was there an existing issue with pexpect way,, any reason to migrate to ssh way ?

@PraveenPenguin
Copy link
Copy Markdown
Collaborator Author

@PraveenPenguin was there an existing issue with pexpect way,, any reason to migrate to ssh way ?

Yes, this change is mainly intended to gradually reduce our dependency on pexpect. In this initial patch set, we are trying to limit console-based interactions wherever feasible and instead run commands over SSH, which should help in minimising potential breakages and improving overall stability.
To your question specifically, there isn’t any known inherent issue with pexpect at the moment. Over time, we’ve already incorporated several workarounds (such as timeouts and similar mechanisms) to make it reasonably reliable for our current needs.
That said, the motivation here is more about moving towards a cleaner and more maintainable approach by avoiding console dependencies wherever possible. Once this initial change is in place, we can thoughtfully extend the same pattern to other areas as well.

@SACHIN-BAPPALIGE
Copy link
Copy Markdown

@PraveenPenguin
Basic tests worked, and other cases are in-progress. One observation from testing: this PR introduces a new dependency (paramiko) which, on ppc64le architectures may require native compilation when the matching Python development headers are missing. To make this more robust, could you please add a check for the paramiko package and, if it is not already installed, ensure that paramiko along with its required dependency packages are installed as part of the setup?

@vrbagalkot
Copy link
Copy Markdown

@PraveenPenguin Basic tests worked, and other cases are in-progress. One observation from testing: this PR introduces a new dependency (paramiko) which, on ppc64le architectures may require native compilation when the matching Python development headers are missing. To make this more robust, could you please add a check for the paramiko package and, if it is not already installed, ensure that paramiko along with its required dependency packages are installed as part of the setup?

@SACHIN-BAPPALIGE Can you please share the logs of the tests that were run. If it has some sensitive info, please share it via slack.

@SACHIN-BAPPALIGE
Copy link
Copy Markdown

@vrbagalkot Please find the attached logs for Fadump case
@PraveenPenguin Fadump case passed , where logs were not captured after triggering crash

[ssh-expect]#sh -c 'echo c > /proc/sysrq-trigger'
2026-05-26 10:08:31,785:op-test.common.OpTestHost:run_command:INFO: 127.0.0.1-2026-05-25-16:19:28
2026-05-26 10:08:31,785:op-test.common.OpTestHost:run_command:INFO: 127.0.0.1-2026-05-26-09:56:15
2026-05-26 10:08:31,785:op-test.common.OpTestHost:run_command:INFO:Running command via SSH: ls /var/crash/127.0.0.1-2026-05-26-09:56:15/vmcore*
2026-05-26 10:08:32,011:op-test.common.OpTestHost:run_command:INFO: /var/crash/127.0.0.1-2026-05-26-09:56:15/vmcore
.
.
ok
Ran 1 test in 1006.429s
OK
2026-05-26 10:08:34,255:op-test::INFO:Exit with Result errors="0" and failures="0"
Fadump_only_SanityTest-Logs-2.txt

@PraveenPenguin PraveenPenguin force-pushed the ssh_fix branch 14 times, most recently from e37b442 to 4a1a59f Compare May 27, 2026 14:25
- Add 60s initial delay before first SSH attempt
- Increase per-attempt SSH timeout from 10s to 30s
- System needs more time to establish SSH after crash/reboot
- Improves reliability of SSH reconnection detection

Fix console thread logging for multiple kdump tests:
When running a test suite with multiple kdump tests, console thread logs
(from 'echo c' crash trigger) were only captured for the first test case.
Subsequent tests showed no console output, making crash/reboot debugging
impossible.

Root cause: Console connections were not being properly cleaned up between
tests, causing subsequent tests to attempt reusing stale/closed console
connections.

Changes:
- OpTestKernelDump.py: Add explicit console deactivation before each test
  with 2-second delay to ensure HMC fully releases the console
- OpTestUtil.py: Add console cleanup at monitoring thread start to ensure
  fresh connections

This ensures each test in a suite gets a clean, active console connection
and all crash/reboot sequences are properly captured in logs.

Fix SSH vs Console usage in kdump tests:
Tests were incorrectly using console (console-expect prompt) instead of SSH
for pre-crash operations, causing unnecessary console connections and missing
proper SSH logging.

Changes:
- OpTestKernelDump.py: Use host_run_command() and run_command_direct() for
  crash trigger instead of pty.sendline() to avoid console connections
- OpTestInstallUtil.py: Use SSH direct command for reboot instead of console
  pty to avoid triggering console-expect prompt
- OpTestSSH.py: Enhanced logging in run_command_direct() to INFO level for
  better visibility of SSH command execution and output

This ensures all pre-crash operations use SSH (ssh-expect) and console is
only used for monitoring crash/reboot sequences, providing proper separation
of concerns and complete logging.

Signed-off-by: Praveen K Pandey <praveen@linux.vnet.ibm.com>
@PraveenPenguin PraveenPenguin force-pushed the ssh_fix branch 3 times, most recently from 6e382f7 to 3cbe0ef Compare May 29, 2026 19:29
- Fix console connection caching issue in OpTestHMC.connect()
  * Validate file descriptor before returning cached connection
  * Force reconnection if cached pty has invalid fd or is not alive
  * Prevents reusing broken pexpect Spawn objects across retries

- Fix missing HMC dumprestart command execution
  * Change from console-based to direct SSH command execution
  * Use cv_HMC.ssh.run_command_direct() instead of cv_HMC.run_command()
  * Bypasses broken console connection entirely
  * Adds proper logging to show command execution

- Add vmcore directory cleanup at test start
  * Remove old crash directories before each test run
  * Ensures clean test environment and proper test isolation
  * Handles both local and remote (net) dump locations

- Include previous console monitoring and SSH improvements
  * Console retry logic with proper cleanup
  * Direct SSH command execution for reliability
  * Thread synchronization improvements

These fixes resolve the child_fd=-1 console connection failures and
ensure HMC crash tests actually trigger kernel dumps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants