IcingaDB: better config and state update queueing by yhabteab · Pull Request #10619 · Icinga/icinga2

yhabteab · 2025-10-31T12:16:55Z

This pull request introduces a new runtime changes queue to IcingaDB, along with several enhancements to the RedisConnection class. These changes aim to improve the memory footprint and number of duplicate (and thus superfluous) Redis queries. The problem of duplicate queries has been a long-standing issue in IcingaDB, and some hacky workarounds have been implemented in the past to mitigate it. This PR takes a more systematic approach as Julian described in #10186 to address the root cause. I will try to summarize the key changes below:

Changes Queue

A new config and state update queue has been introduced to IcingaDB, which allows to batch any config or state related runtime updates for a given object in an efficient manner. This new queue works as outlined below:

First of all, this new way of queueing is only used for a specific set of updates, which include but are not limited to:

Object creation and deletion
Object modification (e.g., updating attributes)
Relationship updates (e.g., adding or removing links between objects)
State related updates (e.g., state changes, acknowledgements, etc.)
Dependency config and state updates

This new queue does not cover any history related writes, those types of events follow a different path and are not affected by this change. The focus here is solely on runtime object changes that affect the normal non-historical operation of IcingaDB. Consequently, history and heartbeat related writes use their own dedicated Redis connection and do not interfere with any of the changes described here.

Now, here is how this queue operates:

When an object is modified, instead of immediately writing the changes to Redis, the object pointer is pushed onto the queue with a corresponding flag indicating the type of change required. As long as the object remains in the queue, any subsequent Redis write requests concerning that object are merged into the existing queued dirty bits. This means that no matter how many times e.g., a OnStateChange is triggered for a given object, only a single write operation will be performed when it is finally popped from the queue. Do note that an object can have multiple dirty bits set, so if both its attributes and state are modified while in the queue, a state and config update will be sent when it is processed.

The consumer of this queue is a new background worker that pops objects from the queue and performs the necessary Redis write operations. This worker doesn't immediately process objects as they are enqueued; instead, it waits for a short period (currently set to 300ms) to allow for more changes to accumulate and be merged. After this wait period, the worker serializes the queued objects according to their dirty bits and sends the appropriate Redis commands. Though, there's also another restriction in place: when the used RedisConnection reaches a certain number of pending commands (currently set to 128), the worker won't dequeue any more objects from the queue until the pending commands drop below that threshold. This ensures that we don't unnecessarily waste memory by serializing too many objects in advance, if the Redis server isn't able to keep up.

To accommodate this new changes queue, quite a number of existing code has been refactored, so that we no longer perform immediate writes to Redis. Additionally, the RedisConnection class has been enhanced to support this new workflow.

RedisConnection Enhancements

Several enhancements have been made to the RedisConnection class to better support the changes queue and improve overall efficiency:

As a consequence of the changes queue, the Redis queries internal queue has significantly been simplified. As opposed to the previous version which used a complex data structure to correctly manage the query priorities, this version uses two separate std::queues to manage the pending queries. One queue is used for queries with high priority, while the other queue is used for low priority queries. High priority queries have always, well, higher priority than low priority queries, and can overtake them at any time. This new approach is much simpler and more efficient, as it eliminates the need for complex data structures and allows for straightforward management of query priorities.
And while I'm at it, I also took the chance to improve the WriteQueueItem type by replacing the previously used ridiculously verbose query types by a more compact std::variant based approach. This not only reduces memory usage but also makes clearer that each item represents exactly one of a defined set of query types and nothing else.

Now, IcingaDB is subscribed to the OnNextCheckChanged signal and not the dummy OnNextCheckUpdated signal anymore. Though, that dummy signal is still there since the IDO relies on it. The only behavioral change in IcingaDB as opposed to before is that the oldest pending Redis query is determined only on the primary Redis connection (the one used for history and heartbeats). If you guys think this is a problem, I can look into a way to have IcingaDB consider all connections when determining the oldest pending query.

resolves #10186

jschmidt-icinga

I didn't test extensively yet, but from a quick test everything seems to work as expected. I also can't speak much for the logic and performance implications of when to send which events to Redis since I barely touched that until now. I'll continue to look at this in the coming days and see if I can test this more thoroughly.

yhabteab · 2025-11-13T15:06:55Z

I'll continue to look at this in the coming days and see if I can test this more thoroughly.

At the moment the integration tests from the Icinga DB repository are the only way to stress test this PR thoroughly. I've been running them ever since the initial implementation and I have almost gone crazy due to a subtle race condition that only showed up when running those tests.

jschmidt-icinga · 2025-11-13T15:16:41Z

a subtle race condition that only showed up when running those tests.

Can you describe this in a bit more detail so I know what to look out for, i.e. the symptoms of the race condition?

yhabteab · 2025-11-13T15:23:26Z

Can you describe this in a bit more detail so I know what to look out for, i.e. the symptoms of the race condition?

Well, the obvious symptom is that the integration tests (specifically the Redundancy Group ones) will sporadically fail because either Icinga 2 didn't sent a delete command when it's supposed to, or deleted something that it shouldn't have. Generally speaking, if the tests don't succeed then there is a bug in here.

julianbrost

Don't consider this a full review, just what I noticed at first glance.

yhabteab · 2026-01-21T12:33:44Z

Rebased + cherry-picked commit dc9d40f + addressed some of the still open discusstions.

Previously, the checkable was locked while processing all the dependency registration stuff, so the worker thread should also do the same to avoid any potential race conditions.

…space

This commit restructures the queue items so that each one now has a method `GetQueueLookupKey()` that is used to derive which elements of the queue are considered to be equal. For this, there is a key extractor for the `multi_index_container` that takes the `variant` from the queue item, calls that method on it, and puts the result in a second variant type. The types in that variant type are automatically deduced from the return types of the individual methods.

Now, the individual `ProcessQueueItem` functions decide whether to acquire an `olock` or not instead of probing this from within the worker loop. This is way easier than having to deal with the potential out of order processing of items in the queue in both ways, i.e., we don't want to send delete events for objects while their created events haven't been processed yet and vice versa.

julianbrost · 2026-04-13T08:53:43Z

While testing, I noticed that there isn't really a way to observe how the new worker thread is performing (except seeing whether things eventually show up in Redis/Icinga Web).

Show a crude patch that adds what I wanted to know right now to IcingadbCheckTask

diff --git a/lib/icingadb/icingadb.hpp b/lib/icingadb/icingadb.hpp
index ebbaebb50..ddc87cfd6 100644
--- a/lib/icingadb/icingadb.hpp
+++ b/lib/icingadb/icingadb.hpp
@@ -484,6 +484,20 @@ private:
        void EnqueueDependencyChildRegistered(const DependencyGroup::Ptr& depGroup, const Checkable::Ptr& child);
        void EnqueueDependencyChildRemoved(const DependencyGroup::Ptr& depGroup, const std::vector<Dependency::Ptr>& dependencies, bool removeGroup);
        void EnqueueRelationsDeletion(const String& id, icingadb::task_queue::RelationsDeletionItem::RelationsKeySet relations);
+
+public:
+       std::pair<std::size_t, std::chrono::steady_clock::duration> GetPendingItemsStats()
+       {
+               std::unique_lock lock(m_PendingItemsMutex);
+
+               auto& seqView = m_PendingItems.get<1>();
+               std::chrono::steady_clock::duration maxAge{};
+               if (!seqView.empty()) {
+                       maxAge = std::chrono::steady_clock::now() - seqView.front().EnqueueTime;
+               }
+
+               return {m_PendingItems.size(), maxAge};
+       }
 };
 }
 
diff --git a/lib/icingadb/icingadbchecktask.cpp b/lib/icingadb/icingadbchecktask.cpp
index f65583116..9f21d7f15 100644
--- a/lib/icingadb/icingadbchecktask.cpp
+++ b/lib/icingadb/icingadbchecktask.cpp
@@ -183,6 +183,10 @@ void IcingadbCheckTask::ScriptFunc(const Checkable::Ptr& checkable, const CheckR
        std::ostringstream i2okmsgs, idbokmsgs, warnmsgs, critmsgs;
        Array::Ptr perfdata = new Array();
 
+       auto stats = conn->GetPendingItemsStats();
+       perfdata->Add(new PerfdataValue("icinga2_pending_items_count", double(stats.first)));
+       perfdata->Add(new PerfdataValue("icinga2_pending_items_age", std::chrono::duration_cast<std::chrono::duration<double>>(stats.second).count()));
+
        i2okmsgs << std::fixed << std::setprecision(3);
        idbokmsgs << std::fixed << std::setprecision(3);
        warnmsgs << std::fixed << std::setprecision(3);

That obviously needs a lot of cleanup, but should be enough what I need for testing right now. Nonetheless, it would be nice if there were some metrics that allow to verify the proper operation.

julianbrost · 2026-04-13T15:11:05Z

Apart from that previous comment, I now concluded my testing without any more interesting findings, so should be mostly fine now. In particular, I also tried putting it under load and Icinga DB had no issues with that (I just managed to overload other things like the IDO and JSON-RPC 😅).

One important use-case for having additional metrics would be that if there comes an Icinga DB related bug report, there is an easy way to check if that could be related to the changes done here. Thus, can you please add something appropriate please?

yhabteab · 2026-04-14T12:43:58Z

That obviously needs a lot of cleanup, but should be enough what I need for testing right now. Nonetheless, it would be nice if there were some metrics that allow to verify the proper operation.

Does this need to be part of the check? I would just add a periodic logging to the event loop with all the relevant metrics that can still be used without having to configure a check for it. See the history bulker for example, I would also just add a similar logging for it here.

julianbrost · 2026-04-14T13:06:49Z

Would that make things much easier? Or is "can still be used without having to configure a check for it" your main motivation for that suggestion?

The icingadb check command already has a few *_backlog metrics that are pretty similar, so I'd say this would be a good fit there.

However, that doesn't mean there shouldn't be any additional logging. A warning message in case this is way behind would be helpful as well.

yhabteab · 2026-04-15T07:51:19Z

A warning message in case this is way behind would be helpful as well.

Why warning? I've just added a normal (info) logging at regular interval (always every 5m unless the worker is lagging behind) and added the the two metrics to the checkcommand's output as well.

[2026-04-15 10:49:14 +0200] information/IcingaDB: Pending items queue size: 94, Redis queries: 0, oldest item age: 302ms
[2026-04-15 10:50:09 +0200] information/IcingaDB: Pending queries: 0 (Input: 219/s; Output: 219/s)
[2026-04-15 10:54:14 +0200] information/IcingaDB: Pending items queue size: 0, Redis queries: 0
[2026-04-15 10:55:09 +0200] information/IcingaDB: Pending queries: 0 (Input: 5/s; Output: 5/s)
[2026-04-15 10:59:14 +0200] information/IcingaDB: Pending items queue size: 1, Redis queries: 0, oldest item age: 305ms

julianbrost

A warning message in case this is way behind would be helpful as well.

Why warning? I've just added a normal (info) logging at regular interval (always every 5m unless the worker is lagging behind)

In that you've added a distinction based on whether the oldest item is due for more than 5 seconds. Isn't that because that usually shouldn't happen, so if it does, there's a good chance that something is wrong here? So wouldn't it make sense to also increase the log severity to warning in that case? So that if something would go wrong and you look at the logs, you have a chance of finding the cause by looking at critical/warning messages first, instead of having to know which information messages you need to look for.

Also, the naming "pending items" it pretty generic, something a bit more descriptive would be nice. Maybe include something like background update queue?

…aDB check

yhabteab · 2026-04-16T06:58:46Z

Sorry, just pushed something new again - I've now inlined the SendNextUpdate method since the it's used by the worker exclusively now. I could've inlined the UpdateState method too but it's somewaht a bit bigger and also contain a doc block, so I didn't want to remove that (and potentially start a long running discussion again). Also, I've removed some superfluous checks on the m_RconWorker pointer.

julianbrost

🎉

julianbrost · 2026-04-16T14:49:07Z

@jschmidt-icinga You still have an unresolved conversation (#10619 (comment)). I think it's long outdated, do you want to have a look at it (that conversation or the whole PR) nonetheless?

jschmidt-icinga · 2026-04-17T06:28:49Z

do you want to have a look at it (that conversation or the whole PR) nonetheless?

I haven't looked at it in a while, and I'm not that experienced in the Redis protocol, so a thorough review would take at least a few days up to a week and I don't think we have the time for that. Therefore I'd skip this one entirely, unless you want me to take a quick look at the code quality aspects only.

julianbrost · 2026-04-17T07:43:24Z

That was just meant as a friendly "your last chance step in before I merge it" reminder, so not a request from my side.

cla-bot Bot added the cla/signed label Oct 31, 2025

yhabteab marked this pull request as draft October 31, 2025 12:17

yhabteab added the area/icingadb New backend label Oct 31, 2025

yhabteab force-pushed the efficient-config-and-state-update-queue branch 3 times, most recently from 200080d to 2099e59 Compare October 31, 2025 14:47

yhabteab added this to the 2.16.0 milestone Oct 31, 2025

yhabteab force-pushed the efficient-config-and-state-update-queue branch from 2099e59 to b15a5fb Compare November 3, 2025 08:17

yhabteab requested a review from jschmidt-icinga November 3, 2025 11:20

yhabteab marked this pull request as ready for review November 3, 2025 11:20

jschmidt-icinga reviewed Nov 13, 2025

View reviewed changes

yhabteab force-pushed the efficient-config-and-state-update-queue branch from b15a5fb to c8274a0 Compare November 13, 2025 15:04

yhabteab requested a review from jschmidt-icinga November 13, 2025 15:23

julianbrost reviewed Nov 19, 2025

View reviewed changes

Comment thread lib/icingadb/icingadb.hpp Outdated

Comment thread lib/icingadb/icingadb-worker.cpp Outdated

Comment thread lib/icingadb/icingadb.hpp Outdated

Comment thread lib/icingadb/redisconnection.hpp Outdated

Comment thread lib/icinga/dependency-group.cpp Outdated

yhabteab force-pushed the efficient-config-and-state-update-queue branch from c8274a0 to 3247cb4 Compare November 20, 2025 12:33

julianbrost reviewed Nov 25, 2025

View reviewed changes

yhabteab force-pushed the efficient-config-and-state-update-queue branch 2 times, most recently from 1916e7a to ac841b3 Compare November 26, 2025 09:32

yhabteab requested a review from julianbrost November 26, 2025 09:35

julianbrost reviewed Nov 26, 2025

View reviewed changes

Comment thread lib/icingadb/icingadb.cpp Outdated

Comment thread lib/icingadb/icingadb-objects.cpp

Comment thread lib/icingadb/icingadb-objects.cpp Outdated

yhabteab force-pushed the efficient-config-and-state-update-queue branch from ac841b3 to a247654 Compare November 26, 2025 16:02

julianbrost reviewed Dec 3, 2025

View reviewed changes

julianbrost reviewed Dec 5, 2025

View reviewed changes

Comment thread lib/icingadb/icingadb-objects.cpp Outdated

Comment thread lib/icingadb/icingadb.hpp

julianbrost assigned julianbrost and yhabteab Jan 13, 2026

yhabteab force-pushed the efficient-config-and-state-update-queue branch from a247654 to 6695294 Compare January 21, 2026 12:31

julianbrost and others added 9 commits April 2, 2026 16:37

IcingaDB: use polymorphism for queue entries

9d5883d

Fix missing olock for dependency child registration

89d8c32

Previously, the checkable was locked while processing all the dependency registration stuff, so the worker thread should also do the same to avoid any potential race conditions.

IcingaDB: remove unused UpdateObjectAttrs method

b633d6b

Simplify IcingaDB::PendingItemsThreadProc() event loop

8375934

IcingaDB: put all queue related stuff into icingadb:task_queue name…

2048450

…space

Reduce min queue item age from 1000ms to 300ms

7d7159c

RedisConnection: simplify GetOldestPendingQueryTs function

267675e

yhabteab force-pushed the efficient-config-and-state-update-queue branch from 3c9c321 to 267675e Compare April 2, 2026 14:38

yhabteab force-pushed the efficient-config-and-state-update-queue branch from cc06eed to a3da6d7 Compare April 15, 2026 07:48

julianbrost reviewed Apr 15, 2026

View reviewed changes

Comment thread lib/icingadb/icingadb-worker.cpp Outdated

Comment thread lib/icingadb/icingadb-worker.cpp Outdated

yhabteab force-pushed the efficient-config-and-state-update-queue branch from a3da6d7 to 41a18c7 Compare April 15, 2026 14:41

julianbrost reviewed Apr 15, 2026

View reviewed changes

Comment thread lib/icingadb/icingadb-worker.cpp Outdated

Comment thread lib/icingadb/icingadb-worker.cpp Outdated

Comment thread lib/icingadb/icingadb-worker.cpp Outdated

Log pending items stats regularly & include them as perfdata in Icing…

99328ec

…aDB check

yhabteab force-pushed the efficient-config-and-state-update-queue branch from 41a18c7 to 99328ec Compare April 15, 2026 15:34

Inline SendNextUpdate & remove superfluous m_RconWorker checks

bfb0e7d

julianbrost approved these changes Apr 16, 2026

View reviewed changes

julianbrost merged commit 61c6c7f into master Apr 17, 2026
160 of 166 checks passed

This was referenced Apr 20, 2026

OTLPMetricsWriter: don't add queue stats as counter #10800

Merged

Overdue state doesn't honor set time periods #10082

Closed

Conversation

yhabteab commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Queue

RedisConnection Enhancements

Uh oh!

jschmidt-icinga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yhabteab commented Nov 13, 2025

Uh oh!

jschmidt-icinga commented Nov 13, 2025

Uh oh!

yhabteab commented Nov 13, 2025

Uh oh!

julianbrost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yhabteab commented Jan 21, 2026

Uh oh!

julianbrost commented Apr 13, 2026

Uh oh!

julianbrost commented Apr 13, 2026

Uh oh!

yhabteab commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julianbrost commented Apr 14, 2026

Uh oh!

yhabteab commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julianbrost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yhabteab commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julianbrost left a comment

Choose a reason for hiding this comment

Uh oh!

yhabteab commented Oct 31, 2025 •

edited

Loading

yhabteab commented Apr 14, 2026 •

edited

Loading

yhabteab commented Apr 15, 2026 •

edited

Loading

yhabteab commented Apr 16, 2026 •

edited

Loading