Skip to content
This repository was archived by the owner on Apr 27, 2022. It is now read-only.
This repository was archived by the owner on Apr 27, 2022. It is now read-only.

If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84

@bleggett

Description

@bleggett

I was testing to see what happens if the Spark master pod is killed.

What happens is that

  1. A new Master pod is created by Kube (good)
  2. The existing Worker pods notice that the old Master pod is gone and try to reconnect (good)
  3. The existing worker pods keep trying to reconnect using the old Master pod IP (bad, the new pod has a new IP)
  4. The worker pods eventually exhaust their reconnection attempts and exit after a long period of time (not great)
  5. As the worker pods slowly do this, they are restarted by Kube and are able to connect to the new Master pod (good).

Is there a way to handle steps 3 & 4 in a better way, so that a failure of the Master pod doesn't render the Spark cluster inoperable for a long period?

Or is this not an oshinko problem, but a problem with

  • Spark
  • openshift-spark

Info appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions