Added pre-upgrade check for defect CSCwt69100#385
Conversation
5ba0fb4 to
a66b043
Compare
Harinadh-Saladi
left a comment
There was a problem hiding this comment.
I could see you have added only full script run logs. Can you attach your function alone logs. Please add the logs for PASS,FAIL and NA scenarios.
As Lovkesh has confirmed that attaching FAIL logs is not required (Issue is not recreatable, since repro factor is very less), pls attach cu log as a fail evidence.
| Due to [CSCwt69100][68], a stale `dbgacEpgSummaryTask` object stuck in `processing` state with empty content can cause the policymgr process to crash on all APICs during an upgrade or process restart. | ||
|
|
||
| Impact: | ||
| Affected versions: version <= 6.1(5e) or version <= 6.2(1g). |
There was a problem hiding this comment.
Pls remove the affected versions, no need to mention here
| Affected versions: version <= 6.1(5e) or version <= 6.2(1g). | ||
|
|
||
| When upgrading Apic from versions prior to 6.0(4c) to versions 6.0(4c) or later, if there is a misconfiguration in the inband management policies (mgmtRsInBStNode) with invalid values, the re-processing triggered by [CSCwh80837][67] will expose the underlying [CSCwd40071][68] defect. This results in continuous policyelem core dumps and switch reboot if Switch are running impacted version of [CSCwd40071][68]. | ||
| The check queries for `dbgacEpgSummaryTask` objects with `operSt="processing"` and `startTs` older than 24 hours. Such objects are considered stale and unexpected. If found, delete them before proceeding with the upgrade to prevent policymgr from crashing on restart. |
There was a problem hiding this comment.
No need to mention about the check and what it does. Just provide appropriate recommended action.
| The [CSCwd40071][68] defect affects versions 5.2(5c) and later with a fix available in 6.0(1g). However, the issue will only be triggered during Apic upgrades crossing 6.0(4c) due to [CSCwh80837][67]. | ||
|
|
||
| [0]: https://github.com/datacenter/ACI-Pre-Upgrade-Validation-Script | ||
| [68]: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwt69100 |
There was a problem hiding this comment.
Pls remove this line and add at the end and also correct the number from 68 to 67
There was a problem hiding this comment.
The reference 68, CSCwt69100 near 0, was already removed. It is now placed at the end as 70 (after 69). The number is 70 and not 67 because after rebasing to v4.2.0-dev, references 67, 68, 69 are already used by other checks in the base branch.
| The svccoreCtrlr and svccoreNode objects represent core files related to Apic and Leaf/Spines process respectively. | ||
|
|
||
| Due to [CSCws84232][67], the APIC GUI may become unresponsive after login, with dashboards stuck in a continuous “Loading…”state. | ||
| Due to [CSCws84232][67], the APIC GUI may become unresponsive after login, with dashboards stuck in a continuous "Loading…"state. |
There was a problem hiding this comment.
Pls revert this if it's done by mistake.
|
|
||
|
|
||
| [0]: https://github.com/datacenter/ACI-Pre-Upgrade-Validation-Script | ||
| [70]: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwt69100 |
There was a problem hiding this comment.
Pls remove it from here as it needs to be added at the end after [69]
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "tversion, icurl_outputs, expected_result, expected_data", |
There was a problem hiding this comment.
Pls add the test case for tversion missing.
| # Case 1: Target version 6.2(2a) is beyond both affected ranges (6.1(5e) and 6.2(1g)). | ||
| # The target binary has the fix so version gate fails. Expected: NA without any API calls. | ||
| ( | ||
| "6.2(2a)", |
There was a problem hiding this comment.
Pls change the version 6.2(2a), it doesn't exist. Update with existing CCO version.
| ], | ||
| ), | ||
| ], | ||
| ) |
There was a problem hiding this comment.
Pls add the test cases for the following
Stale exist for exactly 24hrs
Stale exists for more than 24hrs(25hrs) and less than 24hrs(like 23hrs 59mins) combo
There was a problem hiding this comment.
Added two new test cases: one for exactly 24h (PASS) and one combo of 25h + 23h59m (FAIL_O with only the 25h task reported).
| except ValueError: | ||
| continue | ||
| if task_dt < threshold: | ||
| data.append([dn, start_ts]) |
There was a problem hiding this comment.
I don't see node_id in the output. Pls add it to know on which node issue is encountered.
There was a problem hiding this comment.
node_id is not available in the object's attributes or DN. The DN is already unique enough to identify and delete the specific object.
Harinadh-Saladi
left a comment
There was a problem hiding this comment.
Address the comments
|
|
||
| Due to [CSCwt69100][70], a stale `dbgacEpgSummaryTask` object stuck in `processing` state with empty content can cause the policymgr process to crash on all APICs during an upgrade or process restart. | ||
|
|
||
| Delete any stale `dbgacEpgSummaryTask` objects before proceeding to prevent policymgr from crashing on restart. |
There was a problem hiding this comment.
Pls update the recommended action with 'upgrade'.
| result = PASS | ||
| headers = ["DN", "Start Time"] | ||
| data = [] | ||
| recommended_action = "Delete the listed stale dbgacEpgSummaryTask objects to prevent policymgr crash." |
There was a problem hiding this comment.
Pls update recommended action appropriately.
|
|
||
| Due to [CSCwt69100][70], a stale `dbgacEpgSummaryTask` object stuck in `processing` state with empty content can cause the policymgr process to crash on all APICs during an upgrade or process restart. | ||
|
|
||
| Delete any stale `dbgacEpgSummaryTask` objects before proceeding with the upgrade to prevent policymgr from crashing on restart. |
There was a problem hiding this comment.
Pls remove restart here. We will see issue either through upgrade or process restart. Pls don't include both together.
There was a problem hiding this comment.
Updated the recommended action, removed "on restart" and kept only "upgrade"
|
Pls add pytest logs for your check as you added additional test cases. |
Attached the pytest logs for the check. |
| result = PASS | ||
| headers = ["DN", "Start Time"] | ||
| data = [] | ||
| recommended_action = "Delete the listed stale dbgacEpgSummaryTask objects before the upgrade to prevent policymgr from crashing." |
There was a problem hiding this comment.
is user allowed to perform the Deletion. does delete need root access? if yes. please suggest cu to follow workaround. and share steps in workaround in the bug, .
| rogue_ep_coop_exception_mac_check, | ||
| n9k_c9408_model_lem_count_check, | ||
| inband_management_policy_misconfig_check, | ||
| stale_epg_summary_task_check, |
There was a problem hiding this comment.
name should be stale_dbgacEpgSummaryTask_check
|
|
||
|
|
||
| @check_wrapper(check_title="Stale dbgacEpgSummaryTask Objects") | ||
| def stale_epg_summary_task_check(tversion, **kwargs): |
There was a problem hiding this comment.
chaneg the name to stale_dbgacEpgSummaryTask_check
Summary:
-This PR adds a new validation check: Stale dbgacEpgSummaryTask Objects.
-The check detects stale
dbgacEpgSummaryTaskobjects stuck inprocessingstate that can cause policymgr to crash on all APICs during upgrade (CSCwt69100).What Changed:
stale_epg_summary_task_checkinaci-preupgrade-validation-script.pydocs/docs/validations.mdtests/checks/stale_epg_summary_task_check/Check Behavior:
N/Aif target version is not in affected range (<= 6.1(5e) or <= 6.2(1g))PASSif nodbgacEpgSummaryTaskobjects found inprocessingstatePASSif objects found butstartTsis within 24 hoursFAIL_Oif anydbgacEpgSummaryTaskobject hasstartTsolder than 24 hoursTest Results:
Pytest Logs: Full Run:
CSCwt69100_Pytest_FullRun_Logs.txt
Pytest Logs: stale_epg_summary_task_check (8 test cases)
CSCwt69100_pytest.txt
Full run (tversion 6.1(5e) and 6.2(2a)):
CSCwt69100_FullRun_logs.txt
Script Run Logs (PASS / NA / FAIL):
CSCwt69100_PASS:FAIL:NA_logs.txt
Note: Added FAIL log from CU as lab repro is not feasible.