config agent: reduce full config check frequency from 5s to 1m and compare hashes instead#3028
config agent: reduce full config check frequency from 5s to 1m and compare hashes instead#3028
Conversation
7260b33 to
e81a21e
Compare
BEFORE: Simple polling every 5 secondsAFTER: Hash-based polling (5s hash check, 5m full config fetch) |
6fa1ee4 to
2ab7f72
Compare
|
Here's a before and after comparison of total CPU usage by Arista EOS ConfigAgent when running all tests in parallel. Before change: After change: Note that the peak usage is about the same, but the steady state usage (after all tests have run and the config agents are just polling) is reduced from about 400% (consuming 4 cores) to about 200% (consuming 2 cores) across 29 EOS containers. |
b80e2f9 to
c8454d9
Compare
|
|
||
| // Find unknown BGP peers that need to be removed | ||
| peerFound := func(peer net.IP) bool { | ||
| for _, tun := range device.Tunnels { |
There was a problem hiding this comment.
The old code looped over deviceForRender.Tunnels which looks like it was a post-dedupe slice of tunnels. Should we still be doing that here?
| } | ||
|
|
||
| // Response containing only the config hash | ||
| message ConfigHashResponse { |
There was a problem hiding this comment.
Why make a separate RPC for this as opposed to just checking the hash in ConfigResponse and throwing away the config payload if the last hash is the same? Other than bytes on the wire, doesn't it simplify things?
There was a problem hiding this comment.
Nope, just bytes on the wire, but when we had packet loss from Singapore to the controller it may have helped.
CHANGELOG.md
Outdated
| - CLI | ||
| - Remove restriction for a single tunnel per user; now a user can have a unicast and multicast tunnel concurrently (but can only be a publisher _or_ a subscriber) ([2728](https://github.com/malbeclabs/doublezero/pull/2728)) | ||
| - Device agents | ||
| - Reduce config agent network and CPU usage by checking config checksums every 5 seconds, and reducing full config check frquency to 1m |
| } | ||
|
|
||
| // GetConfigHash returns only the hash of the configuration for change detection | ||
| func (c *Controller) GetConfigHash(ctx context.Context, req *pb.ConfigRequest) (*pb.ConfigHashResponse, error) { |
There was a problem hiding this comment.
Can you add tests for this?
| ).Inc() | ||
|
|
||
| hash := sha256.Sum256([]byte(configStr)) | ||
| getConfigDuration.Observe(float64(time.Since(reqStart).Seconds())) |
There was a problem hiding this comment.
This is re-using the same histogram as the GetConfig RPC. We should create a new histogram for this RPC endpoint because otherwise it will skew the latency profile downward.
…ture docs Extract config generation logic into reusable functions: - generateConfig() - renders device config with deduplication - processConfigRequest() - validates request and finds unknown BGP peers This refactoring prepares for adding GetConfigHash endpoint that will share the same config generation logic. Also add architecture documentation with sequence diagram showing agent-controller communication flow.
…ange detection Add new GetConfigHash RPC that returns only the SHA256 hash of the device configuration (64 bytes) instead of the full config (~50KB). This enables agents to efficiently check for config changes without transferring the full configuration on every poll. Changes: - Add GetConfigHash RPC to controller.proto - Implement GetConfigHash() handler that reuses processConfigRequest() - Add controller_grpc_getconfighash_requests_total metric - Regenerate protobuf code
…meout Replace aggressive 5-second full config polling with hash-based change detection. The agent now: - Checks config hash every 5 seconds (64 bytes) - Only fetches and applies full config when hash changes - Forces full config check after timeout (default 60s) as safety net This dramatically reduces: - Network bandwidth (99%+ when config unchanged) - EOS device load (no config application when unchanged) - Agent CPU (hash computed only when fetching new config) Add --config-cache-timeout-in-seconds flag to control the forced full config check interval. Refactor main loop: - Split pollControllerAndConfigureDevice into focused functions - Add computeChecksum() helper for SHA256 hashing - Add fetchConfigFromController() to get config and compute hash - Add applyConfig() to apply config to EOS device - Rename variables: cachedConfigHash, configCacheTime, configCacheTimeout Add GetConfigHashFromServer() client function to call new gRPC endpoint.
c8454d9 to
ab051f9
Compare
Summary
Changes
Controller
Agent
Testing Verification