Skip to content

Update ocdebug to ssh#31216

Open
kasturinarra wants to merge 1 commit into
openshift:mainfrom
kasturinarra:fix_failure
Open

Update ocdebug to ssh#31216
kasturinarra wants to merge 1 commit into
openshift:mainfrom
kasturinarra:fix_failure

Conversation

@kasturinarra
Copy link
Copy Markdown
Contributor

@kasturinarra kasturinarra commented May 26, 2026

Summary by CodeRabbit

  • Tests
    • Kernel panic recovery test now uses direct hypervisor-mediated SSH for post-crash verification, skips when SSH access is unavailable, establishes multi-hop SSH to target nodes, reads cluster state and configuration remotely, and strengthens checks for the etcd container and recovery manager logs to ensure automated recovery occurred.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Walkthrough

The kernel-panic recovery test now uses hypervisor-mediated two-hop SSH for post-crash verification. It builds SSH known_hosts, queries survived-node pacemaker/etcd metadata via SSH, and performs SSH-based Eventually assertions on the target node's containers and pacemaker logs.

Changes

SSH-based post-crash verification

Layer / File(s) Summary
Hypervisor SSH plumbing
test/extended/edge_topologies/tnf_recovery.go
Parses and validates hypervisor SSH config (private key readable), builds local known_hosts, determines survived/target internal IPs, and prepares per-node remote known_hosts for two-hop SSH.
Survived-node retrieval and target-node verification
test/extended/edge_topologies/tnf_recovery.go
Survived-node queries (pacemaker journal bump-amount, force-new-cluster-bump-amount from /var/lib/etcd/config.yaml, .maxRaftIndex from /var/lib/etcd/revision.json) are executed via hypervisor-mediated SSH; target-node checks (etcd podman running, etcd-previous container exists, pacemaker logs show pod.yaml recreation) are done via SSH inside Eventually assertions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Missing assertion messages at lines 458, 465, 536, 538, 512, 528. Grep at line 575 lacks exact matching (should use '^etcd-previous$'). Add failure messages to all Expect() and Eventually().ShouldNot() calls. Change line 575 grep pattern to '^etcd-previous$' for exact matching as noted in review comment.
Title check ❓ Inconclusive The title 'Update ocdebug to ssh' is partially related to the changeset but lacks specificity. It references the tool change (ocdebug to ssh) but doesn't clarify the context or the actual purpose of the change (kernel panic recovery verification in a test). Consider a more descriptive title such as 'Switch kernel panic recovery test to use SSH instead of ocdebug' to better communicate the change's purpose and scope.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test names (1 Describe, 10 It blocks) are stable with no dynamic information like pod names, timestamps, UUIDs, node names, IP addresses, or variable interpolation.
Microshift Test Compatibility ✅ Passed All tests are in a Describe block marked with [apigroup:config.openshift.io], which MicroShift CI automatically skips, properly protecting them from running on MicroShift.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Test is protected from SNO via SkipIfNotTopology(oc, v1.DualReplicaTopologyMode) in BeforeEach, which skips all tests on single-node clusters by checking infrastructure.Status.ControlPlaneTopology.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies test code only; custom check applies only to deployment manifests, operators, or controllers. File is test/extended/edge_topologies/tnf_recovery.go with no scheduling constraints.
Ote Binary Stdout Contract ✅ Passed All logging uses framework-approved mechanisms: framework.Logf, GinkgoWriter, and Ginkgo test output. No prohibited stdout writes (fmt.Print/Println, log.Print, klog) found in the modified code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed New test file contains no IPv4 hardcoding, IPv4-specific parsing, or external connectivity requirements. Node IPs obtained dynamically and used safely for SSH connections.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kasturinarra
Once this PR has been reviewed and has the lgtm label, please assign eggfoobar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 427-438: The test currently only checks that hypervisor config
text exists; add an explicit fast-fail validation after
exutil.GetHypervisorConfig() that verifies required SSH fields
(sshCfg.HypervisorIP, sshCfg.SSHUser, sshCfg.PrivateKeyPath) are non-empty and
that the PrivateKeyPath is readable (e.g., os.Stat or attempt to open) before
constructing core.SSHConfig and calling core.PrepareLocalKnownHostsFile; on
validation failure call g.Skip or o.Expect/fatal with a clear message so the
test fails fast instead of later inside Eventually loops.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 081ff095-daf5-4eea-9d2f-b98cd1acf06f

📥 Commits

Reviewing files that changed from the base of the PR and between a84c511 and 65f9f4a.

📒 Files selected for processing (1)
  • test/extended/edge_topologies/tnf_recovery.go

Comment thread test/extended/edge_topologies/tnf_recovery.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
test/extended/edge_topologies/tnf_recovery.go (1)

573-585: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing sudo for podman command.

Consistent with the previous comment about line 560, this podman ps command likely needs sudo when running over SSH as the core user to access system containers.

🐛 Proposed fix to add sudo
 		prevOutput, _, err := core.ExecuteRemoteSSHCommand(targetNodeIP,
-			"podman ps -a --format '{{.Names}}' | grep -m1 etcd-previous",
+			"sudo podman ps -a --format '{{.Names}}' | grep -m1 etcd-previous",
 			&hypervisorConfig, localKH, targetRemoteKH)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/edge_topologies/tnf_recovery.go` around lines 573 - 585, The
SSH command invoking podman in the Eventually block uses "podman ps -a ..."
without sudo; update the command string passed to core.ExecuteRemoteSSHCommand
(the call in the anonymous func used by o.Eventually) to prefix podman with sudo
(e.g., "sudo podman ps -a --format ... | grep -m1 etcd-previous") so the core
user can list system containers; keep the rest of the call and error handling
(prevOutput, targetNodeIP, hypervisorConfig, localKH, targetRemoteKH, and the
expectation message referencing targetNode.Name) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 517-519: The SSH reads of /var/lib/etcd in the tnf_recovery test
are missing sudo and can fail with permission denied; update the commands passed
to core.ExecuteRemoteSSHCommand (the grep that sets bumpAmountStr and the jq
that reads revision.json) to prefix them with sudo (and keep the existing sudo
on journalctl) so the reads run as root when executed on RHCOS; ensure you
update both command strings that reference /var/lib/etcd/config.yaml and
/var/lib/etcd/revision.json where survivedNodeIP is used.

---

Duplicate comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 573-585: The SSH command invoking podman in the Eventually block
uses "podman ps -a ..." without sudo; update the command string passed to
core.ExecuteRemoteSSHCommand (the call in the anonymous func used by
o.Eventually) to prefix podman with sudo (e.g., "sudo podman ps -a --format ...
| grep -m1 etcd-previous") so the core user can list system containers; keep the
rest of the call and error handling (prevOutput, targetNodeIP, hypervisorConfig,
localKH, targetRemoteKH, and the expectation message referencing
targetNode.Name) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ba461195-f6ea-478f-8ea5-06193e78151a

📥 Commits

Reviewing files that changed from the base of the PR and between 65f9f4a and af71706.

📒 Files selected for processing (1)
  • test/extended/edge_topologies/tnf_recovery.go

Comment thread test/extended/edge_topologies/tnf_recovery.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/edge_topologies/tnf_recovery.go`:
- Around line 573-585: The grep used in the Eventually block can match
substrings causing flaky matches; update the command passed to
core.ExecuteRemoteSSHCommand to match the container name exactly (for example
use grep -x '^etcd-previous$' or an exact-name filter) so prevOutput will only
be "etcd-previous"; update the check around prevOutput and the error message in
that block (referencing core.ExecuteRemoteSSHCommand, targetNodeIP, prevOutput,
targetNode.Name) to rely on the exact-match output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 310c0f00-29a6-4623-a28a-65ac1f84b85f

📥 Commits

Reviewing files that changed from the base of the PR and between af71706 and 9ce4f00.

📒 Files selected for processing (1)
  • test/extended/edge_topologies/tnf_recovery.go

Comment thread test/extended/edge_topologies/tnf_recovery.go
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 26, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

@kasturinarra: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ovn-two-node-fencing-recovery 9ce4f00 link false /test e2e-metal-ovn-two-node-fencing-recovery
ci/prow/e2e-aws-ovn-fips 9ce4f00 link true /test e2e-aws-ovn-fips

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kasturinarra
Copy link
Copy Markdown
Contributor Author

/test e2e-metal-ovn-two-node-fencing-recovery e2e-aws-ovn-fips

@kasturinarra
Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-ipv6-recovery

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-ipv6-recovery

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/79bf7300-59a2-11f1-92ea-10dc971041d4-0

@kasturinarra
Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@kasturinarra: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ovn-two-node-fencing-recovery

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/52711730-59b2-11f1-84dc-0186a3c34230-0

@kasturinarra
Copy link
Copy Markdown
Contributor Author

/test e2e-metal-ovn-two-node-fencing-recovery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant