Skip to content

Conversation

@rschmitt
Copy link
Contributor

@rschmitt rschmitt commented Aug 16, 2025

This setting prevents connections from going idle for longer than five seconds. Overriding the operating system defaults seems to be a best practice:

https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html

https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/node-reusing-connections.html

This prevents a failure mode where the network silently drops the connection, resulting in a stale connection in the connection pool that the client has no way of knowing is stale:

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

Connections that are dropped in this way create a significant risk of a subsequent request failure, and the exception thrown in this case may or may not be safely retriable. Since TCP keep-alive is extremely cheap, and the failures caused by not using it are quite opaque, I think it makes sense to enable it by default. Traditionally this was not possible in a pure Java client, but the Java ecosystem now offers support for this option pretty much everywhere.

Copy link
Member

@michael-o michael-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not upto us to decide whether people want keep alive by default. I was once analyzing such issues in a GitLab runner behind AWS NAT and luckily no one did fiddle with the defaults. I was rather quickly able to identify the problem.

No default value is ideal. I'd leave as-is.

@rschmitt
Copy link
Contributor Author

I was rather quickly able to identify the problem.

What was the problem?

@michael-o
Copy link
Member

I was rather quickly able to identify the problem.

What was the problem?

We have migrated to a new GitLab instance on AWS which I expected to be technically identical. My jobs with a connection pool in py-requests started failing. After fiddling for two days I managed to figure out that all connections which have been idle for more than 5 min were broken upon acquiring. The operator enabled SO_KEEPALIVE on the host of the container runner and I reconfigured the pool to use those defaults. Problem was solved. The AWS NAT has an idle limit of 5 min of outbound connections.

Without the default we would have never known about this and wouldn't been able to help others as well.

@rschmitt
Copy link
Contributor Author

@michael-o This is the exact problem that this PR is trying to prevent. The client's defaults should be preventing problems, not creating opportunities for manually debugging problems. If users see failing requests with the Apache client and not with some other client, they're going to conclude that the Apache client is broken, not that their client is misconfigured.

@michael-o
Copy link
Member

@michael-o This is the exact problem that this PR is trying to prevent. The client's defaults should be preventing problems, not creating opportunities for manually debugging problems. If users see failing requests with the Apache client and not with some other client, they're going to conclude that the Apache client is broken, not that their client is misconfigured.

I concur. The change is deep inside. You are healing a symptom, not the cause for just one client and the entire set. Ideally, people know their setup, but on the contrary some will fail some not and it will be even harder to get a clear picture why only parts are working. If it wouldn't have been for the debugging the service owner and operator wouldn't know about and wouldn't be able to assist others.

@rhernandez35
Copy link
Contributor

What's wrong with healing symptoms? You're saying it's a desirable property that the client is not resilient in the face of other misconfigurations. I think that's plainly false: it should be resilient by default. The Apache client is not a mechanism for discovering systemic errors, it's an HTTP client and I want it to be a reliable one.

@michael-o
Copy link
Member

What's wrong with healing symptoms? You're saying it's a desirable property that the client is not resilient in the face of other misconfigurations. I think that's plainly false: it should be resilient by default. The Apache client is not a mechanism for discovering systemic errors, it's an HTTP client and I want it to be a reliable one.

Healing symptoms will make you suffer on other ends in the long run and will stay forever.

@rschmitt
Copy link
Contributor Author

@ok2c What are your thoughts on this proposal?

@ok2c
Copy link
Member

ok2c commented Aug 24, 2025

@rschmitt @michael-o I honestly do not have enough experience with complex networks to have a strong opinion. I personally find the proposed defaults reasonable. We resisted the idea of using finite timeouts by default for a long time because it was not "pure" from the JRE perspective. Now with the hindsight that purist approach was wrong. It helps to have reasonable defaults. What we should do is tell the users to customize those based on their specific application requirements. I would also propose this change-set be moved to client and applied by the client builders at the client level and not globally at the core level.

@rschmitt
Copy link
Contributor Author

OK, I suppose I can move these defaults into the client builder. The only other default being set in these classes is the socket timeout.

@ok2c
Copy link
Member

ok2c commented Aug 25, 2025

OK, I suppose I can move these defaults into the client builder. The only other default being set in these classes is the socket timeout.

@rschmitt Finite socket timeout is something no one should object these days. TCP keep-alive appears to be more contentious

@ok2c ok2c force-pushed the master branch 4 times, most recently from 8fb1d8e to bf7d524 Compare September 13, 2025 16:38
This setting prevents connections from going idle for longer than five
seconds. Overriding the operating system defaults seems to be a best
practice:

https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/node-reusing-connections.html

This prevents a failure mode where the network silently drops the
connection, resulting in a stale connection in the connection pool that
the client has no way of _knowing_ is stale:

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

Connections that are dropped in this way create a significant risk of a
subsequent request failure, and the exception thrown in this case may or
may not be safely retriable. Since TCP keep-alive is extremely cheap,
and the failures caused by not using it are quite opaque, I think it
makes sense to enable it by default. Traditionally this was not possible
in a pure Java client, but the Java ecosystem now offers support for
this option pretty much everywhere.
@rschmitt
Copy link
Contributor Author

I set this aside for a bit and I'm just coming back to it now. I was looking into this suggestion:

I would also propose this change-set be moved to client and applied by the client builders at the client level and not globally at the core level.

I'm not so sure this is a good idea now. We'd have to implement two new implicit defaults. HttpAsyncClientBuilder would need a new default IOReactorConfig, and PoolingHttpClientConnectionManagerBuilder would need a new default Resolver<HttpRoute, SocketConfig>. My concern is that a user introducing customization of the IO reactor config or socket config would do so by customizing the default instances in core (IOReactorConfig.DEFAULT and SocketConfig.DEFAULT), and in so doing would inadvertently shut off TCP keep-alive, which may introduce other problems unrelated to the intended change.

I have a high degree of confidence that the settings in this PR are good defaults. Since opening this PR, these settings have been deployed to hundreds of thousands of machines without incident, and our runtime checks catch the remaining situations where these settings aren't supported (e.g. JDK8 on Windows, or Windows Server 2016).

Copy link
Member

@ok2c ok2c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objections.

@rschmitt rschmitt merged commit 6f7ad40 into apache:master Sep 20, 2025
10 checks passed
@rschmitt rschmitt deleted the default branch September 20, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants