Retry recursive message delivery on timeout by tewinget · Pull Request #5 · session-foundation/session-storage-server

tewinget · 2026-03-04T00:26:25Z

Builds on this Oxen PR, original description:

This adds a new thread that blocks on a queue with a list of requests that failed to be propagated to other nodes in the swarm. This happens when there's a recursive swarm request after the initial node has received the request. Failures in those requests are stored into the node that initiated the request and is retried periodically. Retries have an exponential back-off to avoid spamming and they are pruned if the request exceeds a specific age.

Most requests are allowed to be re-submitted way past their initial timestamp as long as they are signed. There are a few exceptions to this, such as delete all and expire all. These requests are still kept around and retried, but, once a retried request returns with an error the retry is dropped immediately. This applies to all requests so the retry queue is always actively managed and pruned.

For infinite retries we now serialise some state to the SQL DB. We introduce a new runtime_state table that stores 2 BT encoded blobs. One containing the composition of the swarms and the list of retryable requests.

Storing the swarms lets the storage server remember what swarm it is in, and detect changes to swarms across restarts. Since it can now remember and detect when its swarm has changed it can now do a DB dump to new members that join the swarm whilst you are offline. This should improve the "correctness" of messages held by new members in the swarm in lieu of active syncing.

Storing the retryable requests lets the retries persist across restarts. This also helps improve the "correctness" of messages in the swarm in lieu of active syncing.

Retryable messages have been changed to use their own table structure and be queryable rather than serialized and put in effectively a singleton database entry. This also means they aren't kept in RAM at all times. The timeout for them has been simplified (but can still change if desired) to an initial retry after 5 seconds, and every minute after that if it continues timing out.

Also included here is the addition of a swarm space cache for messages, so when sending bulk messages to other swarm members we can query just the messages for that swarm (and in batches, for that matter) rather than taking every message from the database into RAM.

…eful for private networks)

We don't do anything in particular in the setter or getter that might warrant having those functions (like a lock or something) so we can remove the indirection and call the member directly.

…lper function Keep the SNDataReady serialisation code compartmentalised into one function that handles both a read and write of the data structure.

This is preliminary work to prep for persisting the swarm member's state to disk to allow resuming from the last known state. Currently when a storage server starts up, it assumes that it's joining a new swarm/new members are joining its swarm and does a full message dump to those members instead of being able to know if a swarm member is new or we already knew about it and then, choose a synchronisation method that is more suited for that particular scenario.

Restoring the swarm state means that the if the node is an active service node, it'll remember which swarm it was in so when the storage server gets restarted, it can correctly detect if a new SN is joining their swarm instead of assuming all the nodes have newly joined their swarm and consequently dump their entire SQL db to them.

When a node initiates a recursive swarm request, the initial node awaits the response from all other nodes before returning to the client. Children swarm nodes that fail to receive the request are stored into a retryable request queue to be re-attempted later. This queue is flushed every 3s by piggybacking onto the swarm member check function that is periodically invoked by OMQ.

… SNSerialiseResult

… members routine

Comment states that only timed-out requests are retried. This is correct as an error response with error code and text are stiuations that might mean that the recipient node is not in a valid state or will ever accept the request in which case the safe default is to not retry to that node. It is possible in future since all possible error states are known to handle them specifically for the command. But for now, a sane default is to only allow retries to nodes that were offline or we failed to communicate with.

This was originally 10s and it was mistakenly changed to 30s. We've reverted all the changes to the swarm member checks so this should go back to its original values as retries are handled in a separate subsystem instead of intertwined with the member checks.

Move serialise result struct for retryable requests into impl file as it's only used locally.

The `new_swarm_member` flag disambiguates between nodes that need a DB dump vs nodes that don't sufficiently that we don't need the intermediate contact details ready state. Also remove unused member var.

Not sure why, it's used in the lambda itself but it does not need to be captured unlike the other 2 constexpr variables that are being captured and used. Alas CI is complaining about it and treating the warning as an error so removing it in this commit.

Instead of keeping retryable requests in memory and storing them in the database as a serialized monolith, now they're stored in a queryable way (and thus don't need to be kept in memory at all times either, which is good in the event that there are a lot). Changed the retry timing to be simpler as well; if something times out on a 5 second timeout, we don't need to retry super frequently.

Messages can now be queried based on swarm space, and so by finding the swarm space boundaries for a swarm id we can query all messages for that swarm (as opposed to getting *all* messages into RAM and sorting from there). The swarm space is stored as two 32-bit integers rather than as a single 64-bit integer, as sqlite does not support 64-bit unsigned integers and thus queries wouldn't work properly. This is an unfortunate side effect of using such a large type for swarm space/ids. This commit also caches our swarm id so we can know what it was between restarts. Previously all swarms' makeups were cached, but this did not seem particularly useful -- it couldn't be queried as it was stored as a serialized blob, and it appears the only useful value there is our current swarm id.

Doy-lee added 30 commits March 3, 2026 19:17

Add --skip-bootstrap to avoid contacting of seed nodes on startup (us…

418e209

…eful for private networks)

Respect the skip_bootstrap_ flag

7ceff8a

Get rid of db getter and setter by making it public

8d18d4b

We don't do anything in particular in the setter or getter that might warrant having those functions (like a lock or something) so we can remove the indirection and call the member directly.

WIP: Return more extensive SN data ready payload

dae553a

Move swarm logs into derivation function, serialise SNDataReady w/ he…

c545a1c

…lper function Keep the SNDataReady serialisation code compartmentalised into one function that handles both a read and write of the data structure.

Timestamps in the DB are stored in millisecond precision

89ceced

Remove unused std::optional header

0b4532e

Update outdated comments

7ee669c

Add tmp work-around for DB swarm dump on migration to DB v1

8ac41cf

Get rid of duplicate fnv hash, already present in serialize.h

c3f97e8

Remove unused TEST_RETRY_INTERVAl/PERIOD constants

be3bd39

Rename BTSerialise to Serialise and prefix SN's SerialiseResult to be…

7b4bee5

… SNSerialiseResult

Remove unused MessageTestStatus

3924d1f

Move retryable requests into its own thread and restore pending swarm…

ad10794

… members routine

Revert SN data ready response, timestamp not needed

c1ee078

Serialise retryable requests to the DB

936c810

Update the swarms serialisation to match retryable requests patterns

9da0460

Move serialise result struct for retryable requests into impl file as it's only used locally.

Remove unused swarm_member_check_deadline, add comments

48c3de0

Simplify swarm member status states by removing ContactDetailsReady

a36b208

The `new_swarm_member` flag disambiguates between nodes that need a DB dump vs nodes that don't sufficiently that we don't need the intermediate contact details ready state. Also remove unused member var.

Fix dblocation typo -> db_location

5a67656

Linting

f52aecd

Request a DB dump on data-ready handshake if DB is empty

9153442

Add 10GiB message payload to serialise test

b9f5d2d

Linting

9ac6c91

Doy-lee and others added 7 commits March 3, 2026 19:17

Reduce serialisation test to 5GiB to not OOM on CI

26b6b22

Reduce serialisation test to 100MiB to not OOM on CI

6fd8391

Stop potential infinite bootstrap of DB if swarm ID doesn't change

5a8f1de

remove dead code and C++-ify some C-esque code

def6b8e

format

6bfd954

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry recursive message delivery on timeout#5

Retry recursive message delivery on timeout#5
tewinget wants to merge 37 commits intosession-foundation:devfrom
tewinget:swarm-sync

tewinget commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tewinget commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants