Skip to content

config agent: reduce full config check frequency from 5s to 1m and compare hashes instead#3028

Draft
nikw9944 wants to merge 4 commits intomainfrom
nikw/config-agent-cpu
Draft

config agent: reduce full config check frequency from 5s to 1m and compare hashes instead#3028
nikw9944 wants to merge 4 commits intomainfrom
nikw/config-agent-cpu

Conversation

@nikw9944
Copy link
Contributor

@nikw9944 nikw9944 commented Feb 17, 2026

Summary

  • Instead of applying config to the device every 5s, apply it only when the config hash changes, or when it has been 60s since it was last applied.
  • Implement hash-based config polling to reduce network overhead by 99%+ when configs are unchanged
  • See comment below for call sequence before and after this change (new sequence has been added to controller's README.md)

Changes

Controller

  • Refactor GetConfig by extracting config generation into reusable helper functions (commit 1)
  • Add GetConfigHash gRPC endpoint that returns only SHA256 hash (64 bytes) instead of full config (~50KB)
  • Add controller_grpc_getconfighash_requests_total metric
  • Add architecture documentation with sequence diagram

Agent

  • Refactor main loop
  • Implement a simple caching scheme for device config and config hash
  • Replace 5-second full config polling with hash polling, fetch full config only on changes
  • Full config always applied after every cache timeout (default 60s)

Testing Verification

  • Unit test updates
  • No functionality has changed so e2e tests should run as-is

@nikw9944 nikw9944 linked an issue Feb 17, 2026 that may be closed by this pull request
@nikw9944 nikw9944 changed the title agent: reduce network and CPU usage by reducing full config check frequency from 5s to 1m and comparing config hashes instead agent: reduce full config check frequency from 5s to 1m and compare config hashes instead Feb 17, 2026
@nikw9944 nikw9944 self-assigned this Feb 20, 2026
@nikw9944 nikw9944 force-pushed the nikw/config-agent-cpu branch from 7260b33 to e81a21e Compare February 20, 2026 19:41
@nikw9944
Copy link
Contributor Author

nikw9944 commented Feb 21, 2026

BEFORE: Simple polling every 5 seconds

┌─────────┐                 ┌────────────┐                 ┌────────────┐                  ┌─────────┐
│  Agent  │                 │ Controller │                 │ Controller │                  │   EOS   │
│  main() │                 │ GetConfig()│                 │  Config    │                  │ Device  │
│         │                 │   (gRPC)   │                 │  Generator │                  │         │
└────┬────┘                 └─────┬──────┘                 └─────┬──────┘                  └────┬────┘
     │                            │                              │                              │
     │ Every 5s:                  │                              │                              │
     │                            │                              │                              │
     │ GetBgpNeighbors()          │                              │                              │
     ├─────────────────────────────────────────────────────────────────────────────────────────►│
     │◄─────────────────────────────────────────────────────────────────────────────────────────┤
     │ [peer IPs]                 │                              │                              │
     │                            │                              │                              │
     │ GetConfigFromServer()      │                              │                              │
     ├───────────────────────────►│                              │                              │
     │                            │ processConfigRequest()       │                              │
     │                            ├─────────────────────────────►│                              │
     │                            │                              │ generateConfig()             │
     │                            │                              │  • deduplicateTunnels()      │
     │                            │                              │  • renderConfig()            │
     │                            │                              │    (~50KB config text)       │
     │                            │◄─────────────────────────────┤                              │
     │                            │ [config string]              │                              │
     │◄───────────────────────────┤                              │                              │
     │ ConfigResponse             │                              │                              │
     │ {config: "..."}            │                              │                              │
     │                            │                              │                              │
     │ AddConfigToDevice(config)  │                              │                              │
     ├─────────────────────────────────────────────────────────────────────────────────────────►│
     │◄─────────────────────────────────────────────────────────────────────────────────────────┤
     │ [config applied]           │                              │                              │
     │                            │                              │                              │
     │ sleep(5s)                  │                              │                              │
     │ goto top                   │                              │                              │
     │                            │                              │                              │

AFTER: Hash-based polling (5s hash check, 5m full config fetch)

┌─────────┐                 ┌────────────┐                 ┌────────────┐                  ┌─────────┐
│  Agent  │                 │ Controller │                 │ Controller │                  │   EOS   │
│  main() │                 │GetConfigHash                 │  Config    │                  │ Device  │
│         │                 │ GetConfig()│                 │  Generator │                  │         │
└────┬────┘                 └─────┬──────┘                 └─────┬──────┘                  └────┬────┘
     │                            │                              │                              │
     │ Every 5s:                  │                              │                              │
     │                            │                              │                              │
     │ GetBgpNeighbors()          │                              │                              │
     ├─────────────────────────────────────────────────────────────────────────────────────────►│
     │◄─────────────────────────────────────────────────────────────────────────────────────────┤
     │ [peer IPs]                 │                              │                              │
     │                            │                              │                              │
     │ Decision: should fetch?    │                              │                              │
     │  • First run (no hash)?    │                              │                              │
     │  • 1m since last apply?    │                              │                              │
     │  • Hash changed?           │                              │                              │
     │                            │                              │                              │
     │ GetConfigHashFromServer()  │                              │                              │
     ├───────────────────────────►│                              │                              │
     │                            │ processConfigRequest()       │                              │
     │                            ├─────────────────────────────►│                              │
     │                            │                              │ generateConfig()             │
     │                            │                              │  • deduplicateTunnels()      │
     │                            │                              │  • renderConfig()            │
     │                            │                              │ SHA256(config)               │
     │                            │◄─────────────────────────────┤                              │
     │                            │ [hash only]                  │                              │
     │◄───────────────────────────┤                              │                              │
     │ ConfigHashResponse         │                              │                              │
     │ {hash: "abc123..."}        │                              │                              │
     │ (64 bytes)                 │                              │                              │
     │                            │                              │                              │
     │ Compare: hash != lastHash? │                              │                              │
     │                            │                              │                              │
     ├─── if YES (or first run or 1m timeout):                                                  │
     │                            │                              │                              │
     │    fetchConfigFromController()                            │                              │
     │    ├─► GetConfigFromServer()                              │                              │
     │    │   ──────────────────► │                              │                              │
     │    │                       │ processConfigRequest()       │                              │
     │    │                       ├─────────────────────────────►│                              │
     │    │                       │                              │ generateConfig()             │
     │    │                       │                              │  • deduplicateTunnels()      │
     │    │                       │                              │  • renderConfig()            │
     │    │                       │                              │    (entire config text)      │
     │    │                       │◄─────────────────────────────┤                              │
     │    │   ◄──────────────────│ [config string]               │                              │
     │    │   ConfigResponse      │                              │                              │
     │    │   {config: "..."}     │                              │                              │
     │    │                       │                              │                              │
     │    ├─► computeChecksum(config)                            │                              │
     │    │   [local SHA256]      │                              │                              │
     │    │                       │                              │                              │
     │    └─► return config+hash  │                              │                              │
     │                            │                              │                              │
     │    applyConfig()           │                              │                              │
     │    └─► AddConfigToDevice(config)                          │                              │
     │        ─────────────────────────────────────────────────────────────────────────────────►│
     │        ◄─────────────────────────────────────────────────────────────────────────────────┤
     │        [config applied]    │                              │                              │
     │                            │                              │                              │
     │    lastChecksum = hash     │                              │                              │
     │    lastApplyTime = now     │                              │                              │
     │                            │                              │                              │
     ├─── else: skip this cycle (hash unchanged, no work needed) |                              │
     │                            │                              │                              │
     │ sleep(5s)                  │                              │                              │
     │ goto top                   │                              │                              │
     │                            │                              │                              │

@nikw9944 nikw9944 force-pushed the nikw/config-agent-cpu branch 5 times, most recently from 6fa1ee4 to 2ab7f72 Compare February 24, 2026 21:03
@nikw9944
Copy link
Contributor Author

nikw9944 commented Feb 27, 2026

Here's a before and after comparison of total CPU usage by Arista EOS ConfigAgent when running all tests in parallel.

Before change:

     1400 +---------------------------------------------------------------------------------------------------------+
          |                 +               A+                 +                 +                +                 |
          |                                 *                                                                       |
     1200 |-+                              **                                                                     +-|
          |                                * *                                                                      |
          |                               *  *                                                                      |
          |                               A  *                                                                      |
     1000 |-+                             *  *                                                                    +-|
          |                               *   *                                                                     |
          |                               *   *                                                                     |
      800 |-+                            *    A        A                                                          +-|
          |                              *     A       *                                                            |
CPU %     |                              *      *A    * *                                                           |
      600 |-+                            *        *   * A                                                         +-|
          |                              *        *   *  *                                                          |
          |                              *         A *    A  *A                *A*AA*A*A*A*AA*A*A*A*AA*A*A*A*A      |
      400 |-+                           *           *A     *A  *AA*A*A*A*AA*A*A                                   +-|
          |                             *                                                                           |
          |                             A                                                                           |
          |                            *                                                                            |
      200 |-+                         A                                                                           +-|
          |                           *                                                                             |
          |                 +        *       +                 +                 +                +                 |
        0 +---------------------------------------------------------------------------------------------------------+
          0                100              200               300               400              500               600
                                                       Elapsed (seconds)

After change:

     1400 +---------------------------------------------------------------------------------------------------------+
          |                 +                +     A           +                 +                +                 |
          |                                        **                                                               |
     1200 |-+                           A         * *                                                             +-|
          |                             *         *  *                                                              |
          |                            * *        *  A                                                              |
          |                            * *        *   *                                                             |
     1000 |-+                          *  *      *    *                                                           +-|
          |                           *   A      A     *                                                            |
          |                           A   *      *     A                                                            |
      800 |-+                         *    *    *       A                                                         +-|
          |                           *    *    *        *A                                                         |
CPU %     |                           *     **A *          *                                                        |
      600 |-+                         *     A * *           A                                                     +-|
          |                          *         *             *A                                                     |
          |                          *         A               *A *A                                                |
      400 |-+                        *                           A  *A*A                                          +-|
          |                          *                                  *AA*A*A                                     |
          |                        A*A                                         *A*AA*A*A*A                          |
          |                       *                                                       *A*AA*A*A*A*AA*A*A        |
      200 |-+                    A                                                                                +-|
          |                     *                                                                                   |
          |                 +   *            +                 +                 +                +                 |
        0 +---------------------------------------------------------------------------------------------------------+
          0                100              200               300               400              500               600
                                                       Elapsed (seconds)

Note that the peak usage is about the same, but the steady state usage (after all tests have run and the config agents are just polling) is reduced from about 400% (consuming 4 cores) to about 200% (consuming 2 cores) across 29 EOS containers.

@nikw9944 nikw9944 force-pushed the nikw/config-agent-cpu branch 2 times, most recently from b80e2f9 to c8454d9 Compare February 27, 2026 21:10
@nikw9944 nikw9944 changed the title agent: reduce full config check frequency from 5s to 1m and compare config hashes instead config agent: reduce full config check frequency from 5s to 1m and compare config hashes instead Feb 27, 2026
@nikw9944 nikw9944 changed the title config agent: reduce full config check frequency from 5s to 1m and compare config hashes instead config agent: reduce full config check frequency from 5s to 1m and compare hashes instead Feb 27, 2026
@nikw9944 nikw9944 marked this pull request as ready for review February 27, 2026 21:13
@nikw9944 nikw9944 requested a review from packethog February 27, 2026 21:13

// Find unknown BGP peers that need to be removed
peerFound := func(peer net.IP) bool {
for _, tun := range device.Tunnels {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code looped over deviceForRender.Tunnels which looks like it was a post-dedupe slice of tunnels. Should we still be doing that here?

https://github.com/malbeclabs/doublezero/pull/3028/changes#diff-f5e715ed1377ac408a4ecde2c60cbcb98327512c7d1496114ef143cb27e7afafL725

}

// Response containing only the config hash
message ConfigHashResponse {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make a separate RPC for this as opposed to just checking the hash in ConfigResponse and throwing away the config payload if the last hash is the same? Other than bytes on the wire, doesn't it simplify things?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just bytes on the wire, but when we had packet loss from Singapore to the controller it may have helped.

CHANGELOG.md Outdated
- CLI
- Remove restriction for a single tunnel per user; now a user can have a unicast and multicast tunnel concurrently (but can only be a publisher _or_ a subscriber) ([2728](https://github.com/malbeclabs/doublezero/pull/2728))
- Device agents
- Reduce config agent network and CPU usage by checking config checksums every 5 seconds, and reducing full config check frquency to 1m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/frquency/frequency

}

// GetConfigHash returns only the hash of the configuration for change detection
func (c *Controller) GetConfigHash(ctx context.Context, req *pb.ConfigRequest) (*pb.ConfigHashResponse, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for this?

).Inc()

hash := sha256.Sum256([]byte(configStr))
getConfigDuration.Observe(float64(time.Since(reqStart).Seconds()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is re-using the same histogram as the GetConfig RPC. We should create a new histogram for this RPC endpoint because otherwise it will skew the latency profile downward.

nikw9944 added 4 commits March 2, 2026 20:12
…ture docs

Extract config generation logic into reusable functions:
- generateConfig() - renders device config with deduplication
- processConfigRequest() - validates request and finds unknown BGP peers

This refactoring prepares for adding GetConfigHash endpoint that will
share the same config generation logic.

Also add architecture documentation with sequence diagram showing
agent-controller communication flow.
…ange detection

Add new GetConfigHash RPC that returns only the SHA256 hash of the
device configuration (64 bytes) instead of the full config (~50KB).

This enables agents to efficiently check for config changes without
transferring the full configuration on every poll.

Changes:
- Add GetConfigHash RPC to controller.proto
- Implement GetConfigHash() handler that reuses processConfigRequest()
- Add controller_grpc_getconfighash_requests_total metric
- Regenerate protobuf code
…meout

Replace aggressive 5-second full config polling with hash-based change
detection. The agent now:
- Checks config hash every 5 seconds (64 bytes)
- Only fetches and applies full config when hash changes
- Forces full config check after timeout (default 60s) as safety net

This dramatically reduces:
- Network bandwidth (99%+ when config unchanged)
- EOS device load (no config application when unchanged)
- Agent CPU (hash computed only when fetching new config)

Add --config-cache-timeout-in-seconds flag to control the forced full
config check interval.

Refactor main loop:
- Split pollControllerAndConfigureDevice into focused functions
- Add computeChecksum() helper for SHA256 hashing
- Add fetchConfigFromController() to get config and compute hash
- Add applyConfig() to apply config to EOS device
- Rename variables: cachedConfigHash, configCacheTime, configCacheTimeout

Add GetConfigHashFromServer() client function to call new gRPC endpoint.
@nikw9944 nikw9944 force-pushed the nikw/config-agent-cpu branch from c8454d9 to ab051f9 Compare March 2, 2026 20:31
@nikw9944 nikw9944 marked this pull request as draft March 2, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce config agent resource consumption

2 participants