Skip to content

Exclude decomissioning nodes when opening new shards, using gRPC stream#6166

Open
ncoiffier-celonis wants to merge 6 commits intoquickwit-oss:mainfrom
ncoiffier-celonis:fix-ingestion-gap-when-decomissioning-node-gRPC
Open

Exclude decomissioning nodes when opening new shards, using gRPC stream#6166
ncoiffier-celonis wants to merge 6 commits intoquickwit-oss:mainfrom
ncoiffier-celonis:fix-ingestion-gap-when-decomissioning-node-gRPC

Conversation

@ncoiffier-celonis
Copy link

Description

Attempt to fix #6158

This PR:

  • enrich the ControlPlaneModel to maintain a list of decomissioning indexer
  • use gRPC stream to notify the control-plane of indexer decomissioning
  • filter out the decomissioning nodes when opening new shards, rebalancing or scaling up shards

Alternative approach to #6165, but using gRPC stream instead of chitchat to propagate the decomissioning status to the control-plane.

Any feedback is welcome!!

How was this PR tested?

In addition of the unit and integration tests, I've run it against a local cluster with 2 indexer and observed that the number of errors reported in #6158 decreases from a few 100 to no errors.

Other considerations

I also considered these 2 approaches:

  • re-using the indexer state (i.e. READY/NOT_READY, by adding a DRAINING state), but an indexer needs to be ready to successfully completed the decomission process
  • using the shard status itself in the decomissioning routine, but the changes were much more "spaghetti", and I couldnt quite make them working.
  • using chitchat to propagate the ingester status: Exclude decomissioning nodes when opening new shards, using chitchat #6165

If we want to de-riskify this change, we could put it behind a feature-flag/config property.

@guilload
Copy link
Member

Hello @ncoiffier-celonis, I was also working on something similar on the branch guilload/ingester-status, but got stalled. Wanna take a look at that and come up with an approach that would combine the best of the 3 candidates that we have right now?

@ncoiffier-celonis
Copy link
Author

@guilload Thank you for the answer. I'll try to look into your branch and come back to you if anything is unclear/if I have some questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Indexer graceful shutdown causes ingestion gap and 500 errors "no shards available"

2 participants