Code conversion recipes, advanced patterns, and large-scale DAG construction guidance.
// Sequential
for (auto& d : data) { process(d); }
// Parallel (iterator-based)
#include <taskflow/algorithm/for_each.hpp>
tf::Executor executor;
tf::Taskflow taskflow;
taskflow.for_each(data.begin(), data.end(), [](int& d) { process(d); });
executor.run(taskflow).wait();
// Parallel (index-based, per-element dispatch)
taskflow.for_each_index(0, N, 2, [](int i) { process(i); }); // [0, N) step 2Chunked variant — for_each_by_index: When you have a large index range and the
per-element work is relatively cheap, per-element dispatch creates scheduling overhead
that can dominate the useful work. for_each_by_index gives each task a subrange to
iterate in a tight loop, amortizing scheduling cost:
tf::IndexRange<size_t> range(0, vec.size(), 1);
taskflow.for_each_by_index(range, [&](tf::IndexRange<size_t> subrange) {
for (size_t i = subrange.begin(); i < subrange.end(); i += subrange.step_size()) {
vec[i] = std::tan(vec[i]);
}
});When to prefer which:
for_each_indexworks well when per-element work is non-trivial (e.g., a complex computation per pixel, a function call with meaningful cost). The partitioner handles chunking internally.for_each_by_indexis useful when you want explicit control over the inner loop, or when the per-element body is very lightweight and you want to ensure loop vectorization and cache-friendly access within each chunk.
Both accept an optional trailing partitioner argument (see docs/taskflow-performance.md).
Iterator-based reduction:
#include <taskflow/algorithm/reduce.hpp>
int result = std::numeric_limits<int>::max();
taskflow.reduce(data.begin(), data.end(), result,
[](int& l, const auto& r) { return std::min(l, r); });
// transform_reduce: transform each element, then reduce
taskflow.transform_reduce(data.begin(), data.end(), result,
[](int l, int r) { return std::min(l, r); }, // reduce
[](const Data& d) { return d.transform(); }); // transformIndex-based reduction — reduce_by_index: When reducing over an index range (counting
primes, summing computed values, accumulating scores), reduce_by_index gives each worker
a subrange to accumulate into a thread-local partial result, then merges partials at the
end. This eliminates shared-state contention entirely:
size_t count = 0;
taskflow.reduce_by_index(
tf::IndexRange<size_t>(0, N, 1),
count,
[](tf::IndexRange<size_t> subrange, std::optional<size_t> running_total) {
size_t local = running_total ? *running_total : 0;
for (size_t i = subrange.begin(); i < subrange.end(); i += subrange.step_size()) {
local += expensive_check(i);
}
return local;
},
std::plus<size_t>{},
tf::DynamicPartitioner{chunk_size} // optional partitioner
);The local operator receives a subrange and an optional running total (empty on the first
call to a given worker). It returns the accumulated value for that chunk. The global
operator (std::plus here) merges partial results across workers.
Why this matters for performance: A common alternative is for_each_index with
std::atomic::fetch_add per element — this works correctly but causes all threads to
contend on a single cache line. At high core counts the contention dominates and the
computation effectively serializes. reduce_by_index avoids this entirely by keeping
accumulation thread-local until the final merge.
When to prefer which:
reduce/transform_reduce— when you already have iterators to a containerreduce_by_index— when reducing over a computed index range, especially with irregular per-element cost. Combine withDynamicPartitionerfor workloads where some indices are much more expensive than others.
#include <taskflow/algorithm/sort.hpp>
taskflow.sort(strings.begin(), strings.end());#include <taskflow/algorithm/scan.hpp>
taskflow.inclusive_scan(input.begin(), input.end(), output.begin(), std::plus<int>{});
// exclusive: taskflow.exclusive_scan(first, last, d_first, init, op)#include <taskflow/algorithm/pipeline.hpp>
const size_t num_lines = 4;
std::array<size_t, num_lines> buffer;
tf::Pipeline pl(num_lines,
tf::Pipe{tf::PipeType::SERIAL, [&](tf::Pipeflow& pf) {
if (pf.token() == 5) { pf.stop(); return; }
buffer[pf.line()] = pf.token();
}},
tf::Pipe{tf::PipeType::PARALLEL, [&](tf::Pipeflow& pf) {
buffer[pf.line()] += 1;
}},
tf::Pipe{tf::PipeType::SERIAL, [&](tf::Pipeflow& pf) {
buffer[pf.line()] += 1;
}}
);
taskflow.composed_of(pl);
executor.run(taskflow).wait();SERIAL stages: one token at a time. PARALLEL stages: concurrent. num_lines = max parallelism.
Taskflow provides three pipeline variants, each suited to different situations:
The example above uses tf::Pipeline, where the number and types of stages are fixed at
compile time via variadic template parameters. Each tf::Pipe stage receives a tf::Pipeflow&
and manages data externally (e.g., via a shared buffer indexed by pf.line()).
This is the simplest option when the pipeline structure is known ahead of time.
When data naturally flows from one stage to the next with different types at each stage,
DataPipeline provides compile-time type safety. Each stage declares its input and output
types, and Taskflow manages the data passing internally:
#include <taskflow/algorithm/data_pipeline.hpp>
tf::DataPipeline pl(num_lines,
tf::make_data_pipe<tf::Pipeflow&, int>(tf::PipeType::SERIAL,
[&](tf::Pipeflow& pf) -> int {
if (pf.token() == N) { pf.stop(); return 0; }
return static_cast<int>(pf.token());
}),
tf::make_data_pipe<int, std::string>(tf::PipeType::PARALLEL,
[](int input) -> std::string {
return std::to_string(input); // receives int, produces string
}),
tf::make_data_pipe<std::string, void>(tf::PipeType::SERIAL,
[&](std::string input) {
results.push_back(std::move(input)); // consumes string
})
);
taskflow.composed_of(pl);
executor.run(taskflow).wait();Key differences from Pipeline:
- Each stage callable receives the previous stage's output as its first argument (not
Pipeflow&) - Stages can optionally take
Pipeflow&as a second argument for token/line queries - The first stage receives only
Pipeflow&(it produces data, doesn't consume) - Data is stored internally in a per-line variant buffer — no external buffer needed
- The first stage must be
SERIAL
DataPipeline is a good fit when stages transform data through a typed chain (e.g.,
read → parse → transform → write). Use regular Pipeline when stages share complex
external state or when the data types are uniform.
When the number of pipeline stages is determined at runtime (e.g., configurable filter
chains, user-defined processing pipelines), ScalablePipeline accepts pipes via an
iterator range:
#include <taskflow/algorithm/pipeline.hpp>
using pipe_t = tf::Pipe<std::function<void(tf::Pipeflow&)>>;
std::vector<pipe_t> pipes;
pipes.emplace_back(tf::PipeType::SERIAL, [&](tf::Pipeflow& pf) {
if (pf.token() == N) { pf.stop(); return; }
buffer[pf.line()] = pf.token();
});
for (size_t i = 0; i < num_filters; i++) {
pipes.emplace_back(tf::PipeType::PARALLEL, [&](tf::Pipeflow& pf) {
buffer[pf.line()] = process(buffer[pf.line()]);
});
}
tf::ScalablePipeline spl(num_lines, pipes.begin(), pipes.end());
taskflow.composed_of(spl);
executor.run(taskflow).wait();Key properties:
- Pipes are provided as an iterator range
[first, last)pointing totf::Pipeobjects - The pipe vector must remain valid for the pipeline's lifetime
- Can be reconfigured between runs with
spl.reset(new_first, new_last)orspl.reset(new_num_lines, new_first, new_last) - Move-only (not copyable)
- The first stage must be
SERIAL
ScalablePipeline is the right choice when pipeline structure varies at runtime. When
the structure is fixed, Pipeline or DataPipeline give better compile-time guarantees
and may allow the compiler to inline stage callables.
Taskflow provides two mechanisms for spawning work dynamically at runtime. They serve different purposes and have different performance characteristics.
For recursive patterns where a function forks child work and waits for results (fibonacci,
merge sort, tree traversal, recursive integration), TaskGroup provides lightweight
cooperative waiting:
#include <taskflow/taskflow.hpp>
// Recursive body — called from within worker context
size_t fibonacci(size_t n, tf::Executor& executor) {
if (n < 2) return n;
size_t a, b;
tf::TaskGroup tg = executor.task_group();
tg.silent_async([n, &a, &executor]() { // fork one child as async task
a = fibonacci(n - 1, executor);
});
b = fibonacci(n - 2, executor); // run other child inline (saves a task)
tg.corun(); // cooperative wait
return a + b;
}
// Entry point — must enter worker context first
size_t fibonacci_entry(tf::Executor& executor, size_t n) {
return executor.async([n, &executor]() {
return fibonacci(n, executor);
}).get();
}Key properties:
corun()is a cooperative wait — the calling worker steals and executes other tasks from the pool while waiting for its children, so no thread goes idle.task_group()can only be called from within a worker of the executor. The initial call must enter worker context viaexecutor.async(...).- For binary recursion, fork one child and run the other inline — this halves the number of tasks created.
- For N-way fan-out (e.g., 10 children in a tree),
silent_asyncall children andcorun():auto tg = executor.task_group(); for (int i = 0; i < N; i++) { tg.silent_async([&, i]() { results[i] = recurse(i, executor); }); } tg.corun();
When you need to build a dynamic sub-graph with explicit task handles, edges, and dependencies (not just fork-join), Subflows provide a full DAG-building API:
taskflow.emplace([](tf::Subflow& subflow) {
auto B1 = subflow.emplace([]() { /* subtask 1 */ });
auto B2 = subflow.emplace([]() { /* subtask 2 */ });
auto B3 = subflow.emplace([]() { /* subtask 3 */ });
B1.precede(B3);
B2.precede(B3);
});Subflows materialize a full task graph with node allocations and edge wiring. They are automatically cleaned up on completion.
Choosing between them:
- TaskGroup is the better fit for recursive fork-join where children are independent
and you just need to wait for them all. It's lighter weight — no graph nodes are
allocated, and
corun()keeps workers productive during the wait. - Subflow is the better fit when you need to express dependencies between dynamically
spawned tasks (e.g., B1→B3, B2→B3 above). It gives you
emplace()+precede()to build arbitrary sub-DAGs. - If you find yourself using
sf.join()in a recursive function with no inter-task dependencies,TaskGroup+corun()is likely a better choice.
Return an int to select which successor runs (by precede() call order):
int counter = 0;
auto init = taskflow.emplace([&](){ counter = 0; });
auto body = taskflow.emplace([&](){ counter++; });
auto cond = taskflow.emplace([&]() -> int {
return counter < 5 ? 0 : 1; // 0 = loop back, 1 = exit
});
auto done = taskflow.emplace([&](){ /* finished */ });
init.precede(body);
body.precede(cond);
cond.precede(body); // index 0: loop back
cond.precede(done); // index 1: exitSuccessor ordering matters. Return value 0 = first precede() target, 1 = second, etc.
Returning -1 (or any value outside [0, num_successors)) selects no successor, ending
that execution path.
When source code wraps a parallel computation in a for loop, a natural but suboptimal
translation is to clear and rebuild the graph each iteration:
// Suboptimal — rebuilds graph and re-launches executor N times
for (int j = 0; j < N; j++) {
taskflow.clear();
taskflow.for_each_index(0, size, 1, [&](int i) { compute(i); });
executor.run(taskflow).wait();
}A condition task eliminates this overhead by cycling the graph internally:
// Better — one graph, one launch, N iterations via condition task
auto body = taskflow.for_each_index(0, size, 1,
[&](int i) { compute(i); }, tf::StaticPartitioner());
auto cond = taskflow.emplace([i = 0]() mutable {
return ++i == N ? -1 : 0; // -1 stops, 0 cycles back to body
});
body.precede(cond);
cond.precede(body);
executor.run(taskflow).wait();The savings come from avoiding per-iteration graph construction and executor re-entry. This pattern is most valuable when the per-iteration work is fast relative to graph setup, or when N is large. For workloads where the graph structure changes between iterations, the loop + clear approach may be unavoidable.
For cases where the graph doesn't need internal looping logic, the executor provides convenience methods to repeat a taskflow externally:
// Run the taskflow exactly N times
executor.run_n(taskflow, 5).wait();
// Run until a predicate returns true
executor.run_until(taskflow, [&]() { return converged; }).wait();Both accept an optional callback invoked after all runs complete:
executor.run_n(taskflow, 10, [](){ std::cout << "done\n"; }).wait();These are simpler than a condition-task cycle when you don't need to interleave the repeat logic with other tasks in the graph. The condition-task approach is more flexible — it can incorporate state from the graph itself and coexist with other tasks and branches in the same DAG.
A standard condition task returns a single int to select one successor. A
multi-condition task returns tf::SmallVector<int> to activate multiple successors
simultaneously:
auto task = taskflow.emplace([]() -> tf::SmallVector<int> {
return {0, 2}; // fire the 1st and 3rd successors
});
task.precede(A, B, C); // A = index 0, B = index 1, C = index 2
// Both A and C will run; B will notTaskflow automatically detects the return type — returning int creates a single-condition
task, returning tf::SmallVector<int> creates a multi-condition task. This is useful for
fork-like control flow where a decision node needs to trigger a subset of its successors
based on runtime conditions.
tf::Taskflow module_tf("module");
module_tf.emplace([]() { /* work */ });
tf::Taskflow main_tf("main");
auto setup = main_tf.emplace([]() { /* setup */ });
auto module = main_tf.composed_of(module_tf); // embed module
auto finish = main_tf.emplace([]() { /* finish */ });
setup.precede(module);
module.precede(finish);The embedded taskflow must outlive the parent.
// Fire-and-forget
std::future<int> result = executor.async([]() { return 42; });
executor.silent_async([]() { /* no future returned */ });
executor.wait_for_all();
// Dependent async — ordering without a full taskflow
auto [A, fuA] = executor.dependent_async([]() { printf("A\n"); });
auto [B, fuB] = executor.dependent_async([]() { printf("B\n"); }, A);
auto [C, fuC] = executor.dependent_async([]() { printf("C\n"); }, A);
auto [D, fuD] = executor.dependent_async([]() { printf("D\n"); }, B, C);
fuD.get();When code is already running inside an executor worker (e.g., inside a task lambda or
after executor.async()), calling executor.run(other_taskflow).wait() would block the
worker thread. executor.corun() provides a cooperative alternative:
taskflow.emplace([&](tf::Runtime& rt) {
tf::Taskflow other;
other.emplace([]() { /* work */ });
rt.executor().corun(other); // cooperative — worker steals while waiting
});corun(target) accepts any object with a Graph& graph() method (including tf::Taskflow).
The calling worker participates in work-stealing while the target graph executes, preventing
thread starvation. It can only be called from within a worker context — calling it from
outside an executor (e.g., from main()) will throw an exception.
There is also executor.corun_until(predicate) for waiting on arbitrary conditions:
taskflow.emplace([&](tf::Runtime& rt) {
auto fu = std::async([]() { /* external work */ });
rt.executor().corun_until([&]() {
return fu.wait_for(std::chrono::seconds(0)) == std::future_status::ready;
});
});The worker keeps stealing and executing tasks from the pool while periodically checking the predicate, rather than blocking idle.
tf::Semaphore semaphore(1); // allow at most 1 concurrent task
auto A = taskflow.emplace([]() { /* critical section */ });
A.acquire(semaphore);
A.release(semaphore);When algorithm patterns don't apply (simulations, iterative solvers, HPC), you build
manual task graphs with emplace() + precede(). At scale (100K+ tasks), this has
unique correctness and performance considerations.
Track which task last wrote to each memory location for correct precede() edges:
tf::Executor executor(num_workers);
tf::Taskflow taskflow;
// Flat vector for O(1) lookup — index by memory location ID
std::vector<tf::Task> last_writer(num_locations);
std::vector<bool> has_writer(num_locations, false);
for (/* each task */) {
int out_loc = /* output location */;
std::vector<int> in_locs = /* input locations */;
tf::Task task = taskflow.emplace([/* capture */]() { /* body */ });
// RAW: wait for producer of each input
for (int in : in_locs) {
if (has_writer[in]) last_writer[in].precede(task);
}
// WAW: wait for previous writer of same output
if (has_writer[out_loc]) last_writer[out_loc].precede(task);
last_writer[out_loc] = task;
has_writer[out_loc] = true;
}
executor.run(taskflow).wait();Strongly prefer flat vectors over std::map/std::unordered_map for these lookups.
Tree-based containers add O(log n) pointer-chasing per lookup with cache misses at each
level. In benchmarks, std::map for slot tracking added ~44% overhead to DAG construction
vs flat-vector indexing. If your location IDs aren't contiguous integers, consider a
translation step to map them to dense indices rather than using a hash map directly.
WAR Dependencies — The Hidden Correctness Bug
The pattern above handles RAW and WAW but misses WAR (Write-After-Read). When tasks
share physical storage reused across iterations (circular buffers, double-buffering,
field % nb_fields indexing), a new writer can overwrite data that earlier tasks still need.
This produces silently wrong results — no crash, no data race detector warning.
Fix: Track readers since the last writer. Before a new write, add edges from all readers:
std::vector<std::vector<tf::Task>> tile_readers(num_locations);
// When creating a task that READS location `in`:
if (has_writer[in]) last_writer[in].precede(task); // RAW
tile_readers[in].push_back(task);
// When creating a task that WRITES location `out`:
for (auto &reader : tile_readers[out]) reader.precede(task); // WAR
tile_readers[out].clear();
if (has_writer[out]) last_writer[out].precede(task); // WAW
last_writer[out] = task;
has_writer[out] = true;When you need WAR tracking: any time the same physical memory is written at different logical timesteps. Common: circular buffers, double-buffering, in-place iterative updates.
When you don't: every task writes to a unique location that no future task overwrites.
| OpenMP | Taskflow equivalent |
|---|---|
depend(in: x) |
RAW: last_writer[x].precede(task) |
depend(out: x) |
WAW + WAR: precede from last writer AND all readers of x |
depend(inout: x) |
Same as out — both reads and writes x |
OpenMP handles RAW, WAW, and WAR automatically. With Taskflow, you track all three explicitly using the last-writer + tile-readers pattern above.
A monolithic DAG (single tf::Taskflow, one executor.run().wait()) is generally faster
than batching because each run().wait() is a full barrier where all workers synchronize.
Monolithic lets the work-stealing scheduler see all tasks at once for better load balancing.
In benchmarks across 300K-5M tasks at 80 cores, monolithic was consistently 20-42% faster than batched approaches. However, monolithic memory usage scales linearly at roughly ~0.5 GB per 1M tasks for DAG metadata. At 5M tasks, expect 2.5-3.5 GB.
Batched pattern (for memory-constrained environments):
tf::Executor executor(num_workers);
tf::Taskflow taskflow;
int batch_size = 32; // timesteps per batch
for (long y = 0; y < total_timesteps; y += batch_size) {
taskflow.clear();
std::fill(has_writer.begin(), has_writer.end(), false);
for (auto &v : tile_readers) v.clear();
long end = std::min(y + batch_size, total_timesteps);
for (long t = y; t < end; t++) {
// build tasks for this timestep, wire dependencies
}
executor.run(taskflow).wait();
}Batch size 20-64 is a reasonable starting point. Larger batches give more cross-timestep parallelism; smaller batches use less memory. Profile to find the sweet spot for your workload.
Instead of storing all task handles (tasks[timestep][point], O(timesteps × width) memory),
use prev/curr vectors and swap each timestep:
std::vector<tf::Task> prev_tasks(max_width), curr_tasks(max_width);
std::vector<bool> prev_valid(max_width, false), curr_valid(max_width, false);
// after each timestep:
std::swap(prev_tasks, curr_tasks);
std::swap(prev_valid, curr_valid);This reduces task handle memory from O(timesteps × width) to O(width). Note that tf::Task
is just a Node* pointer (8 bytes), so the full array at 5M tasks is only ~40MB — well
within L3 cache on server CPUs. Rolling windows help RSS but don't improve construction speed.
At millions of tasks, what you capture in each lambda affects construction cost.
The std::function SBO threshold is only 16 bytes (libstdc++), so most captures
trigger a second heap allocation regardless. But capture size still matters — copying
240 bytes per task is measurably slower than 48 bytes due to memcpy cost and L1/L2 cache
pollution during single-threaded construction. Target ≤48 bytes per capture.
Best pattern — pointer+index into pre-allocated metadata:
// Pre-allocate metadata and input storage BEFORE the construction loop
struct TaskMeta {
const TaskGraph *graph;
char *output_ptr;
char *buf_base;
long timestep, point;
size_t out_bytes, scratch_bytes;
int n_inputs, input_offset; // offset into flat input array
};
std::vector<TaskMeta> all_meta;
all_meta.reserve(total_tasks); // pre-allocate
std::vector<long> all_input_slots;
all_input_slots.reserve(total_tasks * 4); // estimate
// In the construction loop:
int idx = (int)all_meta.size();
all_meta.push_back({&graph, out_ptr, base, t, p, ...});
// ... push input slots into all_input_slots ...
// Lambda captures only pointers + one index (~24 bytes, trivially copyable)
auto *meta_arr = all_meta.data();
auto *slots_arr = all_input_slots.data();
taskflow.emplace([meta_arr, slots_arr, idx]() {
const TaskMeta &m = meta_arr[idx]; // reconstruct everything from metadata
const char *in_ptrs[256]; // build input arrays ON THE STACK
for (int i = 0; i < m.n_inputs; i++)
in_ptrs[i] = m.buf_base + slots_arr[m.input_offset + i] * m.out_bytes;
// ... call execute_point ...
});Anti-pattern — fat struct with embedded arrays (~200+ bytes per capture):
// DON'T: large fixed arrays bloat every capture
struct TaskCapture {
const char *input_ptrs[64]; // 512 bytes!
size_t input_sizes[64]; // 512 bytes!
// ... plus metadata fields
}; // 1KB+ per task × millions of tasks = slow construction + cache thrashingAvoid per-task heap allocations:
- Capturing
std::vectorby value → heap alloc for its internal buffer - Copying structs containing vectors → each vector's buffer is heap-allocated
std::shared_ptrcaptures → control block is heap-allocated
Taskflow already does one mandatory heap allocation per emplace() (new Node). Each
additional allocation adds ~50-150ns. At 1M tasks with a struct containing 3 vectors
captured by value, that's 3M extra allocations on top of the mandatory 1M.
Rules of thumb:
- Target ≤48 bytes per lambda capture (3 pointers + 1 index + 1-2 scalars)
- Pre-allocate metadata and input arrays, then capture pointer+index into them
- Build input pointer arrays on the stack inside the lambda, not in the capture
- Capture pointers (
const T*) or references, not copies of large objects - The capture struct should ideally be trivially copyable — no heap-owning members
- Mark the lambda
mutableif the called function needs non-const pointers from your struct
Large struct copies: Never copy objects containing heap-allocated members (like
std::vector) into per-task captures. Each copy triggers deep copies of all internal
buffers. Instead, capture by reference or pointer — safe as long as the data outlives
executor.run().wait():
// Prefer reference or pointer — zero cost:
const Config &cfg = configs[i];
taskflow.emplace([&cfg, ...]() { cfg.process(...); });
const Config *cfg_ptr = &configs[i];
taskflow.emplace([cfg_ptr, ...]() { cfg_ptr->process(...); });Prefer raw pointers or value captures of plain structs over shared_ptr in task
construction loops. Each shared_ptr copy triggers an atomic reference count increment
during construction and decrement during destruction. At 500K tasks, that's 1M atomic
operations. Additionally, make_shared adds a second heap allocation per task on top
of Taskflow's mandatory new Node().
In benchmarks, shared_ptr per task added ~18% to DAG construction overhead vs raw
pointer captures. This is moderate — not catastrophic — but easily avoided:
// Prefer — raw pointer with externally managed lifetime:
TaskData data(/* ... */);
taskflow.emplace([&data]() { data.execute(); });
// Also good — value capture of plain struct:
struct Cap { int x; float* buf; size_t n; };
Cap cap = {42, buffer.data(), buffer.size()};
taskflow.emplace([cap]() { /* use cap */ });Use shared_ptr when lifetime management genuinely requires it (e.g., dynamically-sized
per-task data). Don't use it simply to avoid thinking about lifetimes — executor.run().wait()
provides a natural lifetime boundary.
When processing multiple independent graphs, build them all into a single tf::Taskflow:
// Faster — all graphs in one DAG (1 barrier):
tf::Taskflow taskflow;
for (auto& graph : graphs) {
build_tasks(taskflow, graph);
}
executor.run(taskflow).wait();
// Slower — one DAG per graph (N barriers):
for (auto& graph : graphs) {
tf::Taskflow taskflow;
build_tasks(taskflow, graph);
executor.run(taskflow).wait();
}Independent graphs have no edges between them, so the work-stealing scheduler automatically
interleaves their tasks. Each extra run().wait() is a full-barrier synchronization.
A runtime task provides direct executor access within a running task:
std::atomic<int> counter{0};
taskflow.emplace([&](tf::Runtime& rt) {
for (int i = 0; i < 100; i++) {
rt.silent_async([&]() { counter.fetch_add(1, std::memory_order_relaxed); });
}
rt.corun(); // block until all async tasks from this runtime finish
});