Skip to content

[Bug] Ceph Identification and Workflow (Rolling Update) #403

@proxforge

Description

@proxforge

Describe the bug

I have a 3 node pve cluster without ceph, but it is recognized as installed with ceph. and waits for ceph tasks to complete, which cant complete, because I dont have ceph installed.

Identification if ceph is installed should maybe be more strict or hardend so it works all the time. another issue that is kinda heavy IF it is done that way:

"[12:28:55] ⚠ Ceph still unknown after 120s — continuing, but verify cluster health after completion"

If they ceph status is unknown I personally would NOT update further, as this can lead to downtime when loosing ceph-quorum or only having 1 copy available of the data (ceph always needs MIN: 2 to allow access to data).

Expected behavior

  • Correct Identification if ceph is used and active
  • no further updates and reboots If ceph status is not OK

Environment

  • PegaProx Version: 0.9.9.3
  • Installation Method: website script
  • Behind Reverse Proxy? No
  • UI? Corporate

Steps to Reproduce

  1. 3 Node PVE Cluster with no ceph installed
  2. I dont know how to reproduce (maybe have ceph repo enabled but not installed?)
  3. Try to update on corporate design
  4. See errors regarding ceph not ok, even though it is installed

Logs

"[12:22:30] Rolling update started
[12:22:30] Settings: skip_up_to_date=True, skip_evacuation=True, evacuation_timeout=1800s
[12:22:30] ⚠️ WARNING: VM evacuation disabled - VMs may be affected if update fails!
[12:22:30] Cluster state: 3/3 nodes online · quorum HELD
[12:22:30] Ceph: ✗ unknown · 0/0 OSDs up
[12:22:30] Ceph warnings: ceph probe failed: rc=1 Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
[12:22:30] === Processing training6 (1/3) ===
[12:22:30] Checking for available updates on training6...
[12:22:33] Found 10 updates available on training6
[12:22:33] Enabling maintenance mode on training6 (SKIP EVACUATION)
[12:22:34] ⚠️ Skipping VM evacuation - VMs remain on node
[12:22:34] Installing updates on training6
[12:22:34] Waiting for update task (timeout: 900s)...
[12:22:34] Update phase: init
[12:22:44] Update phase: apt_dist_upgrade
[12:23:44] Update phase: wait_online
[12:24:34] ✓ Updates installed
[12:24:34] Node training6 rebooting (timeout: 600s)...
[12:26:36] ⚠️ training6 did not go offline within 120s
[12:26:36] Waiting for training6 to come back online...
[12:26:36] ✓ training6 back online (0s)
[12:26:46] Disabling maintenance mode on training6
[12:26:46] → Ceph (if present): noout + norebalance cleared for training6
[12:26:47] Ceph after training6: ✗ unknown · 0/0 OSDs up
[12:26:47] Waiting up to 120s for Ceph to return to HEALTH_OK before next node…
[12:28:55] ⚠ Ceph still unknown after 120s — continuing, but verify cluster health after completion
[12:28:55] ✓ training6 updated successfully
[12:28:55] === Processing training5 (2/3) ===
[12:28:55] Checking for available updates on training5...
[12:28:58] Found 13 updates available on training5
[12:28:58] Enabling maintenance mode on training5 (SKIP EVACUATION)
[12:28:59] ⚠️ Skipping VM evacuation - VMs remain on node
[12:28:59] Installing updates on training5
[12:28:59] Waiting for update task (timeout: 900s)...
[12:28:59] Update phase: init
[12:29:09] Update phase: apt_dist_upgrade
SMBIOS Auto-Configurator"

Checklist

  • I have searched existing issues to make sure this is not a duplicate
  • I am using the latest version of PegaProx

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions