-
Notifications
You must be signed in to change notification settings - Fork 355
Enable five-second TCP keep-alive by default #550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
michael-o
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not upto us to decide whether people want keep alive by default. I was once analyzing such issues in a GitLab runner behind AWS NAT and luckily no one did fiddle with the defaults. I was rather quickly able to identify the problem.
No default value is ideal. I'd leave as-is.
What was the problem? |
We have migrated to a new GitLab instance on AWS which I expected to be technically identical. My jobs with a connection pool in py-requests started failing. After fiddling for two days I managed to figure out that all connections which have been idle for more than 5 min were broken upon acquiring. The operator enabled SO_KEEPALIVE on the host of the container runner and I reconfigured the pool to use those defaults. Problem was solved. The AWS NAT has an idle limit of 5 min of outbound connections. Without the default we would have never known about this and wouldn't been able to help others as well. |
|
@michael-o This is the exact problem that this PR is trying to prevent. The client's defaults should be preventing problems, not creating opportunities for manually debugging problems. If users see failing requests with the Apache client and not with some other client, they're going to conclude that the Apache client is broken, not that their client is misconfigured. |
I concur. The change is deep inside. You are healing a symptom, not the cause for just one client and the entire set. Ideally, people know their setup, but on the contrary some will fail some not and it will be even harder to get a clear picture why only parts are working. If it wouldn't have been for the debugging the service owner and operator wouldn't know about and wouldn't be able to assist others. |
|
What's wrong with healing symptoms? You're saying it's a desirable property that the client is not resilient in the face of other misconfigurations. I think that's plainly false: it should be resilient by default. The Apache client is not a mechanism for discovering systemic errors, it's an HTTP client and I want it to be a reliable one. |
Healing symptoms will make you suffer on other ends in the long run and will stay forever. |
|
@ok2c What are your thoughts on this proposal? |
|
@rschmitt @michael-o I honestly do not have enough experience with complex networks to have a strong opinion. I personally find the proposed defaults reasonable. We resisted the idea of using finite timeouts by default for a long time because it was not "pure" from the JRE perspective. Now with the hindsight that purist approach was wrong. It helps to have reasonable defaults. What we should do is tell the users to customize those based on their specific application requirements. I would also propose this change-set be moved to client and applied by the client builders at the client level and not globally at the core level. |
|
OK, I suppose I can move these defaults into the client builder. The only other default being set in these classes is the socket timeout. |
@rschmitt Finite socket timeout is something no one should object these days. TCP keep-alive appears to be more contentious |
8fb1d8e to
bf7d524
Compare
This setting prevents connections from going idle for longer than five seconds. Overriding the operating system defaults seems to be a best practice: https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/node-reusing-connections.html This prevents a failure mode where the network silently drops the connection, resulting in a stale connection in the connection pool that the client has no way of _knowing_ is stale: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout Connections that are dropped in this way create a significant risk of a subsequent request failure, and the exception thrown in this case may or may not be safely retriable. Since TCP keep-alive is extremely cheap, and the failures caused by not using it are quite opaque, I think it makes sense to enable it by default. Traditionally this was not possible in a pure Java client, but the Java ecosystem now offers support for this option pretty much everywhere.
|
I set this aside for a bit and I'm just coming back to it now. I was looking into this suggestion:
I'm not so sure this is a good idea now. We'd have to implement two new implicit defaults. I have a high degree of confidence that the settings in this PR are good defaults. Since opening this PR, these settings have been deployed to hundreds of thousands of machines without incident, and our runtime checks catch the remaining situations where these settings aren't supported (e.g. JDK8 on Windows, or Windows Server 2016). |
ok2c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objections.
This setting prevents connections from going idle for longer than five seconds. Overriding the operating system defaults seems to be a best practice:
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/node-reusing-connections.html
This prevents a failure mode where the network silently drops the connection, resulting in a stale connection in the connection pool that the client has no way of knowing is stale:
https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout
Connections that are dropped in this way create a significant risk of a subsequent request failure, and the exception thrown in this case may or may not be safely retriable. Since TCP keep-alive is extremely cheap, and the failures caused by not using it are quite opaque, I think it makes sense to enable it by default. Traditionally this was not possible in a pure Java client, but the Java ecosystem now offers support for this option pretty much everywhere.