Added pre-upgrade check for defect CSCwt69100#385
Conversation
5ba0fb4 to
a66b043
Compare
Harinadh-Saladi
left a comment
There was a problem hiding this comment.
I could see you have added only full script run logs. Can you attach your function alone logs. Please add the logs for PASS,FAIL and NA scenarios.
As Lovkesh has confirmed that attaching FAIL logs is not required (Issue is not recreatable, since repro factor is very less), pls attach cu log as a fail evidence.
| Due to [CSCwt69100][68], a stale `dbgacEpgSummaryTask` object stuck in `processing` state with empty content can cause the policymgr process to crash on all APICs during an upgrade or process restart. | ||
|
|
||
| Impact: | ||
| Affected versions: version <= 6.1(5e) or version <= 6.2(1g). |
There was a problem hiding this comment.
Pls remove the affected versions, no need to mention here
| Affected versions: version <= 6.1(5e) or version <= 6.2(1g). | ||
|
|
||
| When upgrading Apic from versions prior to 6.0(4c) to versions 6.0(4c) or later, if there is a misconfiguration in the inband management policies (mgmtRsInBStNode) with invalid values, the re-processing triggered by [CSCwh80837][67] will expose the underlying [CSCwd40071][68] defect. This results in continuous policyelem core dumps and switch reboot if Switch are running impacted version of [CSCwd40071][68]. | ||
| The check queries for `dbgacEpgSummaryTask` objects with `operSt="processing"` and `startTs` older than 24 hours. Such objects are considered stale and unexpected. If found, delete them before proceeding with the upgrade to prevent policymgr from crashing on restart. |
There was a problem hiding this comment.
No need to mention about the check and what it does. Just provide appropriate recommended action.
| The [CSCwd40071][68] defect affects versions 5.2(5c) and later with a fix available in 6.0(1g). However, the issue will only be triggered during Apic upgrades crossing 6.0(4c) due to [CSCwh80837][67]. | ||
|
|
||
| [0]: https://github.com/datacenter/ACI-Pre-Upgrade-Validation-Script | ||
| [68]: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwt69100 |
There was a problem hiding this comment.
Pls remove this line and add at the end and also correct the number from 68 to 67
There was a problem hiding this comment.
The reference 68, CSCwt69100 near 0, was already removed. It is now placed at the end as 70 (after 69). The number is 70 and not 67 because after rebasing to v4.2.0-dev, references 67, 68, 69 are already used by other checks in the base branch.
| The svccoreCtrlr and svccoreNode objects represent core files related to Apic and Leaf/Spines process respectively. | ||
|
|
||
| Due to [CSCws84232][67], the APIC GUI may become unresponsive after login, with dashboards stuck in a continuous “Loading…”state. | ||
| Due to [CSCws84232][67], the APIC GUI may become unresponsive after login, with dashboards stuck in a continuous "Loading…"state. |
There was a problem hiding this comment.
Pls revert this if it's done by mistake.
|
|
||
|
|
||
| [0]: https://github.com/datacenter/ACI-Pre-Upgrade-Validation-Script | ||
| [70]: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwt69100 |
There was a problem hiding this comment.
Pls remove it from here as it needs to be added at the end after [69]
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "tversion, icurl_outputs, expected_result, expected_data", |
There was a problem hiding this comment.
Pls add the test case for tversion missing.
| # Case 1: Target version 6.2(2a) is beyond both affected ranges (6.1(5e) and 6.2(1g)). | ||
| # The target binary has the fix so version gate fails. Expected: NA without any API calls. | ||
| ( | ||
| "6.2(2a)", |
There was a problem hiding this comment.
Pls change the version 6.2(2a), it doesn't exist. Update with existing CCO version.
| ], | ||
| ), | ||
| ], | ||
| ) |
There was a problem hiding this comment.
Pls add the test cases for the following
Stale exist for exactly 24hrs
Stale exists for more than 24hrs(25hrs) and less than 24hrs(like 23hrs 59mins) combo
| except ValueError: | ||
| continue | ||
| if task_dt < threshold: | ||
| data.append([dn, start_ts]) |
There was a problem hiding this comment.
I don't see node_id in the output. Pls add it to know on which node issue is encountered.
There was a problem hiding this comment.
node_id is not available in the object's attributes or DN. The DN is already unique enough to identify and delete the specific object.
Summary:
-This PR adds a new validation check: Stale dbgacEpgSummaryTask Objects.
-The check detects stale
dbgacEpgSummaryTaskobjects stuck inprocessingstate that can cause policymgr to crash on all APICs during upgrade (CSCwt69100).What Changed:
stale_epg_summary_task_checkinaci-preupgrade-validation-script.pydocs/docs/validations.mdtests/checks/stale_epg_summary_task_check/Check Behavior:
N/Aif target version is not in affected range (<= 6.1(5e) or <= 6.2(1g))PASSif nodbgacEpgSummaryTaskobjects found inprocessingstatePASSif objects found butstartTsis within 24 hoursFAIL_Oif anydbgacEpgSummaryTaskobject hasstartTsolder than 24 hoursTest Results:
Pytest:
CSCwt69100_Pytest_FullRun_Logs.txt
Full run (tversion 6.1(5e) and 6.2(2a)):
CSCwt69100_FullRun_logs.txt
Script Run Logs (PASS / NA / FAIL):
CSCwt69100_PASS:FAIL:NA_logs.txt
Note: Added FAIL log from CU as lab repro is not feasible.