You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 27, 2022. It is now read-only.
I was testing to see what happens if the Spark master pod is killed.
What happens is that
A new Master pod is created by Kube (good)
The existing Worker pods notice that the old Master pod is gone and try to reconnect (good)
The existing worker pods keep trying to reconnect using the old Master pod IP (bad, the new pod has a new IP)
The worker pods eventually exhaust their reconnection attempts and exit after a long period of time (not great)
As the worker pods slowly do this, they are restarted by Kube and are able to connect to the new Master pod (good).
Is there a way to handle steps 3 & 4 in a better way, so that a failure of the Master pod doesn't render the Spark cluster inoperable for a long period?
Or is this not an oshinko problem, but a problem with