Skip to content

feat: add eks_scale_node_group playbook action (#223)#2089

Open
gerardrecinto wants to merge 4 commits into
robusta-dev:masterfrom
gerardrecinto:feat/eks-scale-node-group-action
Open

feat: add eks_scale_node_group playbook action (#223)#2089
gerardrecinto wants to merge 4 commits into
robusta-dev:masterfrom
gerardrecinto:feat/eks-scale-node-group-action

Conversation

@gerardrecinto
Copy link
Copy Markdown

Summary

Closes #223

Adds eks_scale_node_group — a playbook action that increases the maxSize of an EKS managed node group. Intended as a remediation step when the cluster autoscaler is blocked because the node group has reached its configured maximum.

New files:

  • playbooks/robusta_playbooks/aws_node_group_actions.py — action + params model
  • tests/test_aws_node_group_actions.py — 6 pytest unit tests

How it works

The action calls eks:DescribeNodegroup to read the current scaling config, then calls eks:UpdateNodegroupConfig to raise maxSize. minSize and desiredSize are left unchanged.

Params:

Param Type Description
cluster_name str EKS cluster name
region str AWS region (e.g. us-east-1)
node_group_name str Managed node group name
new_max_size int New maxSize — must exceed current value
aws_access_key_id str (optional) Falls back to instance role / environment
aws_secret_access_key str (optional) Falls back to instance role / environment

Example playbook config:

triggers:
  - on_prometheus_alert:
      alert_name: KubeNodeNotReady
actions:
  - eks_scale_node_group:
      cluster_name: my-cluster
      region: us-east-1
      node_group_name: workers
      new_max_size: 10

Required IAM permissions: eks:DescribeNodegroup, eks:UpdateNodegroupConfig

Test plan

  • test_scale_up_succeeds — verifies update_nodegroup_config called with correct args and finding emitted
  • test_no_op_when_new_max_not_larger — new_max_size < current max → no update, enrichment message returned
  • test_no_op_when_new_max_is_equal — new_max_size == current max → no update
  • test_raises_on_describe_failureClientError on describe → ActionException raised, update never called
  • test_raises_on_update_failureClientError on update → ActionException raised
  • test_boto_client_uses_explicit_credentials — explicit key/secret passed through to boto3

Adds a new playbook action that increases the maxSize of an EKS managed
node group via boto3. Designed as a remediation step when the cluster
autoscaler cannot provision nodes because the node group has reached its
configured maximum.

- EksNodeGroupParams: cluster_name, region, node_group_name, new_max_size,
  optional explicit AWS credentials (falls back to instance role/env)
- Guards against no-op updates (new_max_size <= current maxSize)
- Raises ActionException on AWS ClientError for describe or update calls
- Preserves existing minSize and desiredSize during the update
- Adds 6 pytest unit tests covering success, no-op, and error paths

Resolves robusta-dev#223
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 26, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ae5916db-c6a1-4d9c-be5a-5b7beda07f9b

📥 Commits

Reviewing files that changed from the base of the PR and between b107301 and a0adae6.

📒 Files selected for processing (1)
  • tests/test_aws_node_group_actions.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_aws_node_group_actions.py

Walkthrough

This PR adds a new EKS remediation action eks_scale_node_group that scales up managed node groups in Amazon EKS clusters. It defines the action parameters, implements the scaling logic with conditional updates, and includes comprehensive test coverage for success cases, no-op conditions, error handling, and credential wiring.

Changes

EKS Node Group Scaling

Layer / File(s) Summary
Action contract and core implementation
playbooks/robusta_playbooks/aws_node_group_actions.py
EksNodeGroupParams defines cluster name, region, node group name, new max size, and optional AWS credentials. The eks_scale_node_group action creates an EKS client, fetches current scaling configuration, returns early with enrichment if new max is not larger than current, otherwise updates the node group config while preserving min and desired sizes, logs the change, creates a finding, and wraps AWS client errors in ActionException.
Test fixtures, helpers, and validation cases
tests/test_aws_node_group_actions.py
Test module with constants, error factory, and mock event fixture. Tests validate successful scaling updates and finding creation, no-op behavior when new max is less than or equal to current max, ActionException raised on describe and update failures, and correct boto3 client instantiation with explicit AWS credentials.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding a new EKS node group scaling playbook action.
Description check ✅ Passed The description is well-organized, explains the purpose, parameters, implementation details, and test plan—all directly related to the changeset.
Linked Issues check ✅ Passed The PR fully implements the objective from issue #223 by adding a new playbook action for EKS managed node groups to increase their maximum size.
Out of Scope Changes check ✅ Passed All changes are within scope: the action implementation, parameters model, and comprehensive unit tests directly support the objective to increase EKS node pool size limits.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
playbooks/robusta_playbooks/aws_node_group_actions.py (1)

61-65: ⚡ Quick win

Preserve original exception context when re-raising.

Use explicit exception chaining in both except ClientError as e blocks (raise ... from e) so the root AWS error remains visible in tracebacks.

Proposed patch
     except ClientError as e:
         raise ActionException(
             ErrorCodes.ACTION_UNEXPECTED_ERROR,
             f"Failed to describe node group '{params.node_group_name}' "
             f"in cluster '{params.cluster_name}': {e}",
-        )
+        ) from e
@@
     except ClientError as e:
         raise ActionException(
             ErrorCodes.ACTION_UNEXPECTED_ERROR,
             f"Failed to update node group '{params.node_group_name}': {e}",
-        )
+        ) from e

Also applies to: 94-97

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@playbooks/robusta_playbooks/aws_node_group_actions.py` around lines 61 - 65,
The except ClientError as e handlers that raise ActionException should preserve
the original exception context by using explicit exception chaining; locate the
raise statements that construct ActionException with
ErrorCodes.ACTION_UNEXPECTED_ERROR (the blocks referencing
params.node_group_name and params.cluster_name and the later similar block at
lines 94-97) and change the re-raise to use "raise ActionException(... ) from e"
so the original AWS ClientError is retained in the traceback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@playbooks/robusta_playbooks/aws_node_group_actions.py`:
- Around line 61-65: The except ClientError as e handlers that raise
ActionException should preserve the original exception context by using explicit
exception chaining; locate the raise statements that construct ActionException
with ErrorCodes.ACTION_UNEXPECTED_ERROR (the blocks referencing
params.node_group_name and params.cluster_name and the later similar block at
lines 94-97) and change the re-raise to use "raise ActionException(... ) from e"
so the original AWS ClientError is retained in the traceback.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dd5c2ac7-227d-4783-8f36-07774d6f876d

📥 Commits

Reviewing files that changed from the base of the PR and between f945b48 and a12181d.

📒 Files selected for processing (2)
  • playbooks/robusta_playbooks/aws_node_group_actions.py
  • tests/test_aws_node_group_actions.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New playbook action to increase node pool size limits

2 participants