Skip to content

Added pre-upgrade check for defect CSCwt69100#385

Open
Priyanka-Patil14 wants to merge 2 commits into
datacenter:v4.2.0-devfrom
Priyanka-Patil14:CSCwt69100-fix-clean
Open

Added pre-upgrade check for defect CSCwt69100#385
Priyanka-Patil14 wants to merge 2 commits into
datacenter:v4.2.0-devfrom
Priyanka-Patil14:CSCwt69100-fix-clean

Conversation

@Priyanka-Patil14
Copy link
Copy Markdown

@Priyanka-Patil14 Priyanka-Patil14 commented May 15, 2026

Summary:
-This PR adds a new validation check: Stale dbgacEpgSummaryTask Objects.

-The check detects stale dbgacEpgSummaryTask objects stuck in processing state that can cause policymgr to crash on all APICs during upgrade (CSCwt69100).

What Changed:

  • Added stale_epg_summary_task_check in aci-preupgrade-validation-script.py
  • Added validation documentation in docs/docs/validations.md
  • Added dedicated unit tests and test data under:
    tests/checks/stale_epg_summary_task_check/

Check Behavior:

  • Returns `MANUAL' if target version is missing
  • Returns N/A if target version is not in affected range (<= 6.1(5e) or <= 6.2(1g))
  • Returns PASS if no dbgacEpgSummaryTask objects found in processing state
  • Returns PASS if objects found but startTs is within 24 hours
  • Returns FAIL_O if any dbgacEpgSummaryTask object has startTs older than 24 hours

Test Results:

Note: Added FAIL log from CU as lab repro is not feasible.

@Priyanka-Patil14 Priyanka-Patil14 changed the base branch from v4.1.0-dev to v4.2.0-dev May 25, 2026 04:53
Copy link
Copy Markdown
Contributor

@Harinadh-Saladi Harinadh-Saladi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see you have added only full script run logs. Can you attach your function alone logs. Please add the logs for PASS,FAIL and NA scenarios.

As Lovkesh has confirmed that attaching FAIL logs is not required (Issue is not recreatable, since repro factor is very less), pls attach cu log as a fail evidence.

Comment thread docs/docs/validations.md Outdated
Due to [CSCwt69100][68], a stale `dbgacEpgSummaryTask` object stuck in `processing` state with empty content can cause the policymgr process to crash on all APICs during an upgrade or process restart.

Impact:
Affected versions: version <= 6.1(5e) or version <= 6.2(1g).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls remove the affected versions, no need to mention here

Comment thread docs/docs/validations.md Outdated
Affected versions: version <= 6.1(5e) or version <= 6.2(1g).

When upgrading Apic from versions prior to 6.0(4c) to versions 6.0(4c) or later, if there is a misconfiguration in the inband management policies (mgmtRsInBStNode) with invalid values, the re-processing triggered by [CSCwh80837][67] will expose the underlying [CSCwd40071][68] defect. This results in continuous policyelem core dumps and switch reboot if Switch are running impacted version of [CSCwd40071][68].
The check queries for `dbgacEpgSummaryTask` objects with `operSt="processing"` and `startTs` older than 24 hours. Such objects are considered stale and unexpected. If found, delete them before proceeding with the upgrade to prevent policymgr from crashing on restart.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention about the check and what it does. Just provide appropriate recommended action.

Comment thread docs/docs/validations.md Outdated
The [CSCwd40071][68] defect affects versions 5.2(5c) and later with a fix available in 6.0(1g). However, the issue will only be triggered during Apic upgrades crossing 6.0(4c) due to [CSCwh80837][67].

[0]: https://github.com/datacenter/ACI-Pre-Upgrade-Validation-Script
[68]: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwt69100
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls remove this line and add at the end and also correct the number from 68 to 67

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference 68, CSCwt69100 near 0, was already removed. It is now placed at the end as 70 (after 69). The number is 70 and not 67 because after rebasing to v4.2.0-dev, references 67, 68, 69 are already used by other checks in the base branch.

Comment thread docs/docs/validations.md Outdated
The svccoreCtrlr and svccoreNode objects represent core files related to Apic and Leaf/Spines process respectively.

Due to [CSCws84232][67], the APIC GUI may become unresponsive after login, with dashboards stuck in a continuous Loading…state.
Due to [CSCws84232][67], the APIC GUI may become unresponsive after login, with dashboards stuck in a continuous "Loading…"state.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls revert this if it's done by mistake.

Comment thread docs/docs/validations.md Outdated


[0]: https://github.com/datacenter/ACI-Pre-Upgrade-Validation-Script
[70]: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwt69100
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls remove it from here as it needs to be added at the end after [69]



@pytest.mark.parametrize(
"tversion, icurl_outputs, expected_result, expected_data",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls add the test case for tversion missing.

# Case 1: Target version 6.2(2a) is beyond both affected ranges (6.1(5e) and 6.2(1g)).
# The target binary has the fix so version gate fails. Expected: NA without any API calls.
(
"6.2(2a)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls change the version 6.2(2a), it doesn't exist. Update with existing CCO version.

],
),
],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls add the test cases for the following
Stale exist for exactly 24hrs
Stale exists for more than 24hrs(25hrs) and less than 24hrs(like 23hrs 59mins) combo

except ValueError:
continue
if task_dt < threshold:
data.append([dn, start_ts])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see node_id in the output. Pls add it to know on which node issue is encountered.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node_id is not available in the object's attributes or DN. The DN is already unique enough to identify and delete the specific object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants