Skip to content

test: add debugging for raid tests#625

Open
richm wants to merge 5 commits into
linux-system-roles:mainfrom
richm:test-catch-raid-errors
Open

test: add debugging for raid tests#625
richm wants to merge 5 commits into
linux-system-roles:mainfrom
richm:test-catch-raid-errors

Conversation

@richm

@richm richm commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

The raid tests are the ones which most often fail with flakes. Add
some debugging to try to find out why they fail.

Signed-off-by: Rich Megginson rmeggins@redhat.com

Summary by CodeRabbit

  • Tests
    • Added a debug task to capture detailed system/storage/RAID diagnostics when tests fail, ensuring required tools are present.
    • Wrapped RAID test sequences in structured blocks with rescue handling so failures trigger debug capture and then fail explicitly.
    • Reorganized RAID lifecycle and option tests (create/verify/idempotence/remove) for clearer ordering and more reliable, diagnosable outcomes.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a debug task that collects system and RAID diagnostics and updates four RAID test playbooks to wrap test sequences in Ansible block/rescue handlers that include the debug task and then fail the test on error.

Changes

Test failure diagnostics

Layer / File(s) Summary
Debug information collection task
tests/tasks/debug.yml
New Ansible task module that collects system debug information via a strict Bash script executing storage commands (lsblk, mdadm, dmsetup, lvm, udevadm), enumerates /dev/md* and runs mdadm --detail, outputs selected lvs/pvs fields and appends logs; task uses changed_when: false.
Apply debug rescue handlers to test playbooks
tests/tests_create_raid_pool_then_remove.yml, tests/tests_create_raid_volume_then_remove.yml, tests/tests_raid_pool_options.yml, tests/tests_raid_volume_options.yml
Four RAID test playbooks are wrapped in named Ansible blocks with rescue sections that include tests/tasks/debug.yml and then call fail: on error; task ordering around RAID create/verify/remove flows was adjusted accordingly.
🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description Format ⚠️ Warning PR description lacks required template sections. Missing "Enhancement:", "Reason:", and "Result:" headers as defined in .github/pull_request_template.md. Reformat the PR description to follow the template with Enhancement, Reason, Result, and optional Issue Tracker Tickets sections.
Description check ❓ Inconclusive The PR description explains the purpose and reason (RAID tests fail frequently) but doesn't follow the template structure with all required sections. Consider following the template with Enhancement, Reason, and Result sections for consistency with repository standards.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title follows the Conventional Commits format with type 'test' and a clear description about adding debugging for raid tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 10.16%. Comparing base (59fd1c6) to head (102d95e).
⚠️ Report is 147 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (59fd1c6) and HEAD (102d95e). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (59fd1c6) HEAD (102d95e)
sanity 1 0
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #625      +/-   ##
==========================================
- Coverage   16.54%   10.16%   -6.39%     
==========================================
  Files           2        9       +7     
  Lines         284     2056    +1772     
  Branches       79        0      -79     
==========================================
+ Hits           47      209     +162     
- Misses        237     1847    +1610     
Flag Coverage Δ
sanity ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/tasks/debug.yml`:
- Around line 2-41: The playbook's debug tasks can abort on undefined variables
or command failures; update the Play "Gather debug information" (shell) to avoid
set -e (remove the -e flag or add "set +e" after setup) or make each diagnostic
command non-fatal (append "|| true" or wrap commands so failures don't exit),
and keep changed_when: false; update the RAID-related when checks to guard
against undefined vars by using "is defined" or defaults (e.g., replace
conditions referencing storage_test_volume.type and storage_test_volume._device
with checks like "storage_test_volume is defined and storage_test_volume.type ==
'raid' and (storage_test_volume._device is defined and
storage_test_volume._device | length > 0)" or use
"storage_test_volume._raw_device | default('') | length > 0" if you intend to
check _raw_device), fix the mismatch by making the when inspect the same field
used by the command (use _raw_device when checking before calling mdadm with
_raw_device), and make the mdadm command tasks and the /dev/md/... detail task
non-fatal by adding ignore_errors: true or failed_when: false so diagnostics run
best-effort.

In `@tests/tests_create_raid_pool_then_remove.yml`:
- Around line 204-219: The task title says "Create a RAID1 lvm raid device" but
the storage_pools entry sets raid_level: raid0, causing misleading logs; update
either the task name or the storage_pools configuration so they match—e.g.,
change raid_level to raid1 in the storage_pools > volumes entry (or rename the
task to RAID0) and ensure the raid_disks and expected behavior in the
surrounding tasks reference the same RAID level (check the storage_pools,
volumes, raid_disks, and raid_level symbols).

In `@tests/tests_raid_volume_options.yml`:
- Around line 18-25: The task title "Create a RAID0 device mounted on \"{{
mount_location }}\"" is inconsistent with the vars where
storage_volumes[0].raid_level is "raid1"; update either the task name or the
raid_level so they match (e.g., change the task title to mention RAID1 or set
storage_volumes[0].raid_level to "raid0") and ensure the referenced variable
names (storage_volumes, name: test1, raid_level) remain correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a316fac1-fac0-43da-b725-64cf05c866bf

📥 Commits

Reviewing files that changed from the base of the PR and between 4dc4779 and a5a51c0.

📒 Files selected for processing (5)
  • tests/tasks/debug.yml
  • tests/tests_create_raid_pool_then_remove.yml
  • tests/tests_create_raid_volume_then_remove.yml
  • tests/tests_raid_pool_options.yml
  • tests/tests_raid_volume_options.yml

Comment thread tests/tasks/debug.yml Outdated
Comment thread tests/tests_create_raid_pool_then_remove.yml Outdated
Comment thread tests/tests_raid_volume_options.yml Outdated
@richm richm force-pushed the test-catch-raid-errors branch 2 times, most recently from e793a16 to 7bd50a1 Compare June 4, 2026 14:03
@richm

richm commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

[citest]

@richm richm force-pushed the test-catch-raid-errors branch from 7bd50a1 to 463b0c8 Compare June 4, 2026 20:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/tasks/debug.yml`:
- Line 6: The diagnostic script currently uses `set -euxo pipefail`, causing
early exit when commands like `mdadm --detail "$dev"` or `pvs`/`lvs` fail; make
the task non-fatal so subsequent diagnostics still run by either adding
`failed_when: false` to this Ansible task in tests/tasks/debug.yml or by
suppressing failures for the specific commands (e.g., append `|| true` or
temporarily disable `-e` for `mdadm --detail "$dev"` and the `pvs`/`lvs`
invocations) so failures are tolerated and the task completes on a best-effort
basis.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9cd8f9ca-56d9-4efe-b62b-c7f3f7026c0b

📥 Commits

Reviewing files that changed from the base of the PR and between 7bd50a1 and 463b0c8.

📒 Files selected for processing (5)
  • tests/tasks/debug.yml
  • tests/tests_create_raid_pool_then_remove.yml
  • tests/tests_create_raid_volume_then_remove.yml
  • tests/tests_raid_pool_options.yml
  • tests/tests_raid_volume_options.yml
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/tests_create_raid_volume_then_remove.yml
  • tests/tests_raid_volume_options.yml
  • tests/tests_raid_pool_options.yml
  • tests/tests_create_raid_pool_then_remove.yml

Comment thread tests/tasks/debug.yml Outdated
@richm

richm commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

[citest]

4 similar comments
@richm

richm commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

[citest]

@richm

richm commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

[citest]

@richm

richm commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

[citest]

@richm

richm commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

[citest]

richm added 5 commits June 23, 2026 17:34
The raid tests are the ones which most often fail with flakes.  Add
some debugging to try to find out why they fail.

Signed-off-by: Rich Megginson <rmeggins@redhat.com>
@richm richm force-pushed the test-catch-raid-errors branch from e7c2341 to 102d95e Compare June 23, 2026 23:34
@vojtechtrefny

vojtechtrefny commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

This is a bug somewhere in blivet. The problem is we don't recognize the LUKS device is actually unlocked so we don't close it before trying to stop and remove the MD array. I am not sure why blivet doesn't know the LUKS is opened (in some cases), I have some theories but no fix yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants