If master pod is destroyed and recreated, it takes ages for worker pods to timeout

I was testing to see what happens if the Spark master pod is killed.

What happens is that

1. A new Master pod is created by Kube (good)
2. The existing Worker pods notice that the old Master pod is gone and try to reconnect (good)
3. The existing worker pods keep trying to reconnect using the old Master pod IP (bad, the new pod has a new IP)
4. The worker pods eventually exhaust their reconnection attempts and exit after a long period of time (not great)
5. As the worker pods slowly do this, they are restarted by Kube and are able to connect to the new Master pod (good).


Is there a way to handle steps 3 & 4 in a better way, so that a failure of the Master pod doesn't render the Spark cluster inoperable for a long period?

Or is this not an oshinko problem, but a problem with

- Spark
- openshift-spark

Info appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions