Skip to content

Retry logic doesn't match datastax recommendation #128

@therapy-lf

Description

@therapy-lf

As it says in https://docs.datastax.com/en/developer/java-driver/3.2/manual/retries/#retries-and-idempotence

  • retrying in onReadTimeout is always safe, since by definition this error indicates that the query was a read, which didn’t mutate any data;
  • similarly, onUnavailable is safe: the coordinator is telling us that it didn’t find enough replicas, so we know that it didn’t try to apply the query.
  • onWriteTimeout is not safe: some replicas failed to reply to the coordinator in time, but they might still have applied the mutation;
  • onRequestError is not safe either: the query might have been applied before the error occurred. In particular, an OperationTimedOutException could be caused by a network issue that prevented a successful response to come back to the client.

But looking into a code I see that it won't retry onUnavailable but in the same time it will on onWriteTimeout which is not safe:
https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L727-L764
https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/policies/retry/simple.lua#L38-L48

Also, it is not pretty clear where those timeouts come https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L752

So, I have a few questions:

  1. Could it be possible that retry logic is broken or I'm wrong?
  2. Is it safe to retry request when those unknown timeouts occur and from where they might come?

Currently, we switched off the retry mechanism by setting retry_on_timeout to false and max_retries to one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions