Skip to content

Conversation

@jagerman
Copy link
Member

@jagerman jagerman commented Jan 7, 2026

Using OMQ caused scalability problems, where we had to run several OxenMQ instances and load balance across them because of the overhead of going back to poll for incoming sockets across thousands of file descriptors.

This switches the code to use QUIC instead of OMQ, and along the way fixes some other issues and makes other improvements:

  • bumps to libpqxx 8.x RC as it gets rid of a lot of the code that was being used to convert values (now built-in with libpqxx).
  • Initial loading of subscribers from the database is now considerably faster, and no longer crashes:
    • Row data is now loaded in parallel using multiple threads, and only the final transfer into data structures (where deduplication happens, and thus needs to be single-threaded) is in the main thread. This improves startup time by roughly 3-4x.
    • Fixed a crash that could be caused by a race condition where the loading code threw an exception about expired entries if the entry expired between the time the pre-loading database clean happened and now.
  • Add a workaround that does a full scan on SN composition changes rather than taking the "optimized" path because that optimized path is many times slower.

Using OMQ caused scalability problems, where we had to run several
OxenMQ instances and load balance across them because of the overhead of
going back to poll for incoming sockets across thousands of file
descriptors.

This switches the code to use QUIC instead of OMQ, and along the way
fixes some other issues and makes other improvements:

- bumps to libpqxx 8.x RC as it gets rid of a lot of the code that was
  being used to convert values (now built-in with libpqxx).
- Initial loading of subscribers from the database is now considerably
  faster, and no longer crashes:
  - Row data is now loaded in parallel using multiple threads, and only
    the final transfer into data structures (where deduplication
    happens, and thus needs to be single-threaded) is in the main
    thread.  This improves startup time by roughly 3-4x.
  - Fixed a crash that could be caused by a race condition where the
    loading code threw an exception about expired entries if the entry
    expired between the time the pre-loading database clean happened and
    now.
- Add a workaround that does a full scan on SN composition changes
  rather than taking the "optimized" path because that optimized path is
  many times slower.
This will allow subscribe/unsubscribe requests to be submitted over
Session Router in the future.
It was rather awkward to have to load the main SPNS via Python; this
rewrites it to create a first-class `SPNS` binary instead, using an
embedded Python interpreter for the one small part (initial config
parsing) that still relies on Python (for parsing and sharing config
with notifiers).

Also:
- removes the old firebase and older fcm notifier (firebase notification
  is now C++ and lives in its own repo)
- install (new) binary, library, and pybind module, rather than dumping
  them in the top-level project.  The binary + library are installed
  under CMAKE_INSTALL_PREFIX, and the python3 executable goes into the
  python user site library location
  (`~/.local/lib/python3.x/site-packages/spns/core.cpython-....so`)
@jagerman jagerman changed the title Libquic Refactor to use libquic instead of oxenmq for SN connections Jan 7, 2026
This was waiting inside the loop, which meant the loop was blocked
preventing the notifiers we are waiting for from actually being able to
register.

This fixes it by waiting outside the loop instead.
Some SNs seem to struggle with a single request for 5000; reducing it
ought to make it a little easier on the network without being *too* slow
to get up and running with initial subs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant