[fix](dns-cache) evict hostnames that fail to resolve repeatedly#63363
Open
zhaorongsheng wants to merge 2 commits into
Open
[fix](dns-cache) evict hostnames that fail to resolve repeatedly#63363zhaorongsheng wants to merge 2 commits into
zhaorongsheng wants to merge 2 commits into
Conversation
DNSCache currently never evicts an entry once it has been inserted.
When a backend host is permanently dropped from the cluster and its
DNS record is removed, every other BE in the cluster keeps:
1. logging `failed to get ip from host: <removed>` every 60s,
indefinitely (from DNSCache::_refresh_cache -> hostname_to_ipv4);
2. handing the stale cached IP back to brpc through
BrpcClientCache / ClientCache, which then keeps emitting
`Fail to wait EPOLLOUT ... Connection timed out` at the brpc
socket layer.
This commit adds a consecutive-failure counter to DNSCache. When the
counter reaches a configurable threshold, the entry is removed from
the cache so that callers no longer get a stale IP and the refresh
thread stops logging about it. WARNING logs for the same host are
also throttled to avoid flooding be.WARNING.
New BE configs (both mutable):
- dns_cache_max_consecutive_failures (default 30): evict after this
many consecutive resolution failures. Set <= 0 to disable
(legacy behavior).
- dns_cache_log_every_n_failures (default 60): throttle the
`use cached ip` warning. Set <= 1 to log every failure
(legacy behavior).
Successful resolution clears the failure counter, so transient DNS
hiccups don't accumulate.
Issue: apache#63358
Signed-off-by: zhaorongsheng <zhaorongsheng@corp.netease.com>
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
/review |
Signed-off-by: zhaorongsheng <zhaorongsheng@corp.netease.com>
Contributor
Author
|
/review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
Issue Number: close #63358
DNSCache currently never evicts an entry once it has been inserted. When a backend host is permanently dropped from the cluster and its DNS record is removed, every other BE in the cluster keeps:
This PR adds a simple consecutive-failure counter to DNSCache. When the counter reaches a configurable threshold, the entry is removed from the cache so that callers no longer get a stale IP and the refresh thread stops logging about
it. WARNING logs for the same host are also throttled to avoid flooding be.WARNING.
Configs introduced
Name: dns_cache_max_consecutive_failures
Type: mInt32
Default: 30
Behavior: Evict a hostname after this many consecutive resolution failures. At the default 60 s refresh interval, that means ~30 minutes of grace. Set <= 0 to disable eviction (legacy behavior).
────────────────────────────────────────
Name: dns_cache_log_every_n_failures
Type: mInt32
Default: 60
Behavior: Throttle the Failed to resolve ... use cached ip warning to once per N failures per hostname. Set <= 1 to log every failure (legacy behavior).
Both are mutable so operators can tune without restarting BE.
Backward compatibility
Setting dns_cache_max_consecutive_failures = 0 and dns_cache_log_every_n_failures = 1 reproduces exactly the pre-PR behavior. Successful resolution clears the failure counter, so transient DNS hiccups don't accumulate across hours. No
public API or wire format changes.
Further comments
Checklist