Skip to content

[fix](dns-cache) evict hostnames that fail to resolve repeatedly#63363

Open
zhaorongsheng wants to merge 2 commits into
apache:masterfrom
zhaorongsheng:fix-dnscache-evict
Open

[fix](dns-cache) evict hostnames that fail to resolve repeatedly#63363
zhaorongsheng wants to merge 2 commits into
apache:masterfrom
zhaorongsheng:fix-dnscache-evict

Conversation

@zhaorongsheng
Copy link
Copy Markdown
Contributor

Proposed changes

Issue Number: close #63358

DNSCache currently never evicts an entry once it has been inserted. When a backend host is permanently dropped from the cluster and its DNS record is removed, every other BE in the cluster keeps:

  1. logging failed to get ip from host: every 60 s, indefinitely (from DNSCache::_refresh_cache -> hostname_to_ipv4);
  2. handing the stale cached IP back to brpc through BrpcClientCache / ClientCache, which then keeps emitting Fail to wait EPOLLOUT ... Connection timed out at the brpc socket layer.

This PR adds a simple consecutive-failure counter to DNSCache. When the counter reaches a configurable threshold, the entry is removed from the cache so that callers no longer get a stale IP and the refresh thread stops logging about
it. WARNING logs for the same host are also throttled to avoid flooding be.WARNING.

Configs introduced

Name: dns_cache_max_consecutive_failures
Type: mInt32
Default: 30
Behavior: Evict a hostname after this many consecutive resolution failures. At the default 60 s refresh interval, that means ~30 minutes of grace. Set <= 0 to disable eviction (legacy behavior).
────────────────────────────────────────
Name: dns_cache_log_every_n_failures
Type: mInt32
Default: 60
Behavior: Throttle the Failed to resolve ... use cached ip warning to once per N failures per hostname. Set <= 1 to log every failure (legacy behavior).

Both are mutable so operators can tune without restarting BE.

Backward compatibility

Setting dns_cache_max_consecutive_failures = 0 and dns_cache_log_every_n_failures = 1 reproduces exactly the pre-PR behavior. Successful resolution clears the failure counter, so transient DNS hiccups don't accumulate across hours. No
public API or wire format changes.

Further comments

  • I deliberately did not touch the FE-side DNSCache.java. If the same fix is wanted on FE, happy to send a follow-up PR.
  • The eviction threshold default (30) is conservative; please push back if you'd prefer a smaller / larger default. Operators with very flaky DNS can lower it via the mutable config without redeploying.

Checklist

DNSCache currently never evicts an entry once it has been inserted.
When a backend host is permanently dropped from the cluster and its
DNS record is removed, every other BE in the cluster keeps:

  1. logging `failed to get ip from host: <removed>` every 60s,
     indefinitely (from DNSCache::_refresh_cache -> hostname_to_ipv4);
  2. handing the stale cached IP back to brpc through
     BrpcClientCache / ClientCache, which then keeps emitting
     `Fail to wait EPOLLOUT ... Connection timed out` at the brpc
     socket layer.

This commit adds a consecutive-failure counter to DNSCache. When the
counter reaches a configurable threshold, the entry is removed from
the cache so that callers no longer get a stale IP and the refresh
thread stops logging about it. WARNING logs for the same host are
also throttled to avoid flooding be.WARNING.

New BE configs (both mutable):
  - dns_cache_max_consecutive_failures (default 30): evict after this
    many consecutive resolution failures. Set <= 0 to disable
    (legacy behavior).
  - dns_cache_log_every_n_failures (default 60): throttle the
    `use cached ip` warning. Set <= 1 to log every failure
    (legacy behavior).

Successful resolution clears the failure counter, so transient DNS
hiccups don't accumulate.

Issue: apache#63358
Signed-off-by: zhaorongsheng <zhaorongsheng@corp.netease.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zhaorongsheng
Copy link
Copy Markdown
Contributor Author

/review

Signed-off-by: zhaorongsheng <zhaorongsheng@corp.netease.com>
@zhaorongsheng
Copy link
Copy Markdown
Contributor Author

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] DNSCache never evicts unresolvable hostnames after a BE is dropped, causing be.WARNING flood and persistent brpc EPOLLOUT timeout

2 participants