Enable five-second TCP keep-alive by default #550

rschmitt · 2025-08-16T23:57:20Z

This setting prevents connections from going idle for longer than five seconds. Overriding the operating system defaults seems to be a best practice:

https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html

https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/node-reusing-connections.html

This prevents a failure mode where the network silently drops the connection, resulting in a stale connection in the connection pool that the client has no way of knowing is stale:

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

Connections that are dropped in this way create a significant risk of a subsequent request failure, and the exception thrown in this case may or may not be safely retriable. Since TCP keep-alive is extremely cheap, and the failures caused by not using it are quite opaque, I think it makes sense to enable it by default. Traditionally this was not possible in a pure Java client, but the Java ecosystem now offers support for this option pretty much everywhere.

michael-o

It is not upto us to decide whether people want keep alive by default. I was once analyzing such issues in a GitLab runner behind AWS NAT and luckily no one did fiddle with the defaults. I was rather quickly able to identify the problem.

No default value is ideal. I'd leave as-is.

rschmitt · 2025-08-18T15:56:05Z

I was rather quickly able to identify the problem.

What was the problem?

michael-o · 2025-08-19T08:41:20Z

I was rather quickly able to identify the problem.

What was the problem?

We have migrated to a new GitLab instance on AWS which I expected to be technically identical. My jobs with a connection pool in py-requests started failing. After fiddling for two days I managed to figure out that all connections which have been idle for more than 5 min were broken upon acquiring. The operator enabled SO_KEEPALIVE on the host of the container runner and I reconfigured the pool to use those defaults. Problem was solved. The AWS NAT has an idle limit of 5 min of outbound connections.

Without the default we would have never known about this and wouldn't been able to help others as well.

rschmitt · 2025-08-20T17:34:47Z

@michael-o This is the exact problem that this PR is trying to prevent. The client's defaults should be preventing problems, not creating opportunities for manually debugging problems. If users see failing requests with the Apache client and not with some other client, they're going to conclude that the Apache client is broken, not that their client is misconfigured.

michael-o · 2025-08-20T18:25:24Z

@michael-o This is the exact problem that this PR is trying to prevent. The client's defaults should be preventing problems, not creating opportunities for manually debugging problems. If users see failing requests with the Apache client and not with some other client, they're going to conclude that the Apache client is broken, not that their client is misconfigured.

I concur. The change is deep inside. You are healing a symptom, not the cause for just one client and the entire set. Ideally, people know their setup, but on the contrary some will fail some not and it will be even harder to get a clear picture why only parts are working. If it wouldn't have been for the debugging the service owner and operator wouldn't know about and wouldn't be able to assist others.

rhernandez35 · 2025-08-20T21:06:21Z

What's wrong with healing symptoms? You're saying it's a desirable property that the client is not resilient in the face of other misconfigurations. I think that's plainly false: it should be resilient by default. The Apache client is not a mechanism for discovering systemic errors, it's an HTTP client and I want it to be a reliable one.

michael-o · 2025-08-20T21:11:22Z

What's wrong with healing symptoms? You're saying it's a desirable property that the client is not resilient in the face of other misconfigurations. I think that's plainly false: it should be resilient by default. The Apache client is not a mechanism for discovering systemic errors, it's an HTTP client and I want it to be a reliable one.

Healing symptoms will make you suffer on other ends in the long run and will stay forever.

rschmitt · 2025-08-23T17:47:20Z

@ok2c What are your thoughts on this proposal?

ok2c · 2025-08-24T08:48:50Z

@rschmitt @michael-o I honestly do not have enough experience with complex networks to have a strong opinion. I personally find the proposed defaults reasonable. We resisted the idea of using finite timeouts by default for a long time because it was not "pure" from the JRE perspective. Now with the hindsight that purist approach was wrong. It helps to have reasonable defaults. What we should do is tell the users to customize those based on their specific application requirements. I would also propose this change-set be moved to client and applied by the client builders at the client level and not globally at the core level.

rschmitt · 2025-08-25T17:28:41Z

OK, I suppose I can move these defaults into the client builder. The only other default being set in these classes is the socket timeout.

ok2c · 2025-08-25T19:11:05Z

OK, I suppose I can move these defaults into the client builder. The only other default being set in these classes is the socket timeout.

@rschmitt Finite socket timeout is something no one should object these days. TCP keep-alive appears to be more contentious

This setting prevents connections from going idle for longer than five seconds. Overriding the operating system defaults seems to be a best practice: https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/node-reusing-connections.html This prevents a failure mode where the network silently drops the connection, resulting in a stale connection in the connection pool that the client has no way of _knowing_ is stale: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout Connections that are dropped in this way create a significant risk of a subsequent request failure, and the exception thrown in this case may or may not be safely retriable. Since TCP keep-alive is extremely cheap, and the failures caused by not using it are quite opaque, I think it makes sense to enable it by default. Traditionally this was not possible in a pure Java client, but the Java ecosystem now offers support for this option pretty much everywhere.

rschmitt · 2025-09-16T18:04:41Z

I set this aside for a bit and I'm just coming back to it now. I was looking into this suggestion:

I would also propose this change-set be moved to client and applied by the client builders at the client level and not globally at the core level.

I'm not so sure this is a good idea now. We'd have to implement two new implicit defaults. HttpAsyncClientBuilder would need a new default IOReactorConfig, and PoolingHttpClientConnectionManagerBuilder would need a new default Resolver<HttpRoute, SocketConfig>. My concern is that a user introducing customization of the IO reactor config or socket config would do so by customizing the default instances in core (IOReactorConfig.DEFAULT and SocketConfig.DEFAULT), and in so doing would inadvertently shut off TCP keep-alive, which may introduce other problems unrelated to the intended change.

I have a high degree of confidence that the settings in this PR are good defaults. Since opening this PR, these settings have been deployed to hundreds of thousands of machines without incident, and our runtime checks catch the remaining situations where these settings aren't supported (e.g. JDK8 on Windows, or Windows Server 2016).

ok2c

I have no objections.

michael-o requested changes Aug 18, 2025

View reviewed changes

ok2c force-pushed the master branch 4 times, most recently from 8fb1d8e to bf7d524 Compare September 13, 2025 16:38

rschmitt force-pushed the default branch from 358e0dc to 2bc0ba2 Compare September 16, 2025 17:45

ok2c approved these changes Sep 17, 2025

View reviewed changes

rschmitt merged commit 6f7ad40 into apache:master Sep 20, 2025
10 checks passed

rschmitt deleted the default branch September 20, 2025 19:24

Enable five-second TCP keep-alive by default #550

Enable five-second TCP keep-alive by default #550

Uh oh!

Conversation

rschmitt commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michael-o left a comment

Choose a reason for hiding this comment

Uh oh!

rschmitt commented Aug 18, 2025

Uh oh!

michael-o commented Aug 19, 2025

Uh oh!

rschmitt commented Aug 20, 2025

Uh oh!

michael-o commented Aug 20, 2025

Uh oh!

rhernandez35 commented Aug 20, 2025

Uh oh!

michael-o commented Aug 20, 2025

Uh oh!

rschmitt commented Aug 23, 2025

Uh oh!

ok2c commented Aug 24, 2025

Uh oh!

rschmitt commented Aug 25, 2025

Uh oh!

ok2c commented Aug 25, 2025

Uh oh!

rschmitt commented Sep 16, 2025

Uh oh!

ok2c left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rschmitt commented Aug 16, 2025 •

edited

Loading