Taskflow Patterns & Recipes

Code conversion recipes, advanced patterns, and large-scale DAG construction guidance.

4. Converting Sequential Code

4a. Loop Iterations → `for_each` / `for_each_index` / `for_each_by_index`

// Sequential
for (auto& d : data) { process(d); }

// Parallel (iterator-based)
#include <taskflow/algorithm/for_each.hpp>
tf::Executor executor;
tf::Taskflow taskflow;
taskflow.for_each(data.begin(), data.end(), [](int& d) { process(d); });
executor.run(taskflow).wait();

// Parallel (index-based, per-element dispatch)
taskflow.for_each_index(0, N, 2, [](int i) { process(i); });  // [0, N) step 2

Chunked variant — for_each_by_index: When you have a large index range and the per-element work is relatively cheap, per-element dispatch creates scheduling overhead that can dominate the useful work. for_each_by_index gives each task a subrange to iterate in a tight loop, amortizing scheduling cost:

tf::IndexRange<size_t> range(0, vec.size(), 1);
taskflow.for_each_by_index(range, [&](tf::IndexRange<size_t> subrange) {
  for (size_t i = subrange.begin(); i < subrange.end(); i += subrange.step_size()) {
    vec[i] = std::tan(vec[i]);
  }
});

When to prefer which:

for_each_index works well when per-element work is non-trivial (e.g., a complex computation per pixel, a function call with meaningful cost). The partitioner handles chunking internally.
for_each_by_index is useful when you want explicit control over the inner loop, or when the per-element body is very lightweight and you want to ensure loop vectorization and cache-friendly access within each chunk.

Both accept an optional trailing partitioner argument (see docs/taskflow-performance.md).

4b. Accumulation → `reduce` / `reduce_by_index` / `transform_reduce`

Iterator-based reduction:

#include <taskflow/algorithm/reduce.hpp>
int result = std::numeric_limits<int>::max();
taskflow.reduce(data.begin(), data.end(), result,
  [](int& l, const auto& r) { return std::min(l, r); });

// transform_reduce: transform each element, then reduce
taskflow.transform_reduce(data.begin(), data.end(), result,
  [](int l, int r) { return std::min(l, r); },   // reduce
  [](const Data& d) { return d.transform(); });   // transform

Index-based reduction — reduce_by_index: When reducing over an index range (counting primes, summing computed values, accumulating scores), reduce_by_index gives each worker a subrange to accumulate into a thread-local partial result, then merges partials at the end. This eliminates shared-state contention entirely:

size_t count = 0;
taskflow.reduce_by_index(
  tf::IndexRange<size_t>(0, N, 1),
  count,
  [](tf::IndexRange<size_t> subrange, std::optional<size_t> running_total) {
    size_t local = running_total ? *running_total : 0;
    for (size_t i = subrange.begin(); i < subrange.end(); i += subrange.step_size()) {
      local += expensive_check(i);
    }
    return local;
  },
  std::plus<size_t>{},
  tf::DynamicPartitioner{chunk_size}  // optional partitioner
);

The local operator receives a subrange and an optional running total (empty on the first call to a given worker). It returns the accumulated value for that chunk. The global operator (std::plus here) merges partial results across workers.

Why this matters for performance: A common alternative is for_each_index with std::atomic::fetch_add per element — this works correctly but causes all threads to contend on a single cache line. At high core counts the contention dominates and the computation effectively serializes. reduce_by_index avoids this entirely by keeping accumulation thread-local until the final merge.

When to prefer which:

reduce / transform_reduce — when you already have iterators to a container
reduce_by_index — when reducing over a computed index range, especially with irregular per-element cost. Combine with DynamicPartitioner for workloads where some indices are much more expensive than others.

4c. Sorting → `sort`

#include <taskflow/algorithm/sort.hpp>
taskflow.sort(strings.begin(), strings.end());

4d. Prefix Sums → `inclusive_scan` / `exclusive_scan`

#include <taskflow/algorithm/scan.hpp>
taskflow.inclusive_scan(input.begin(), input.end(), output.begin(), std::plus<int>{});
// exclusive: taskflow.exclusive_scan(first, last, d_first, init, op)

4e. Multi-Stage Streaming → Pipeline

#include <taskflow/algorithm/pipeline.hpp>
const size_t num_lines = 4;
std::array<size_t, num_lines> buffer;

tf::Pipeline pl(num_lines,
  tf::Pipe{tf::PipeType::SERIAL, [&](tf::Pipeflow& pf) {
    if (pf.token() == 5) { pf.stop(); return; }
    buffer[pf.line()] = pf.token();
  }},
  tf::Pipe{tf::PipeType::PARALLEL, [&](tf::Pipeflow& pf) {
    buffer[pf.line()] += 1;
  }},
  tf::Pipe{tf::PipeType::SERIAL, [&](tf::Pipeflow& pf) {
    buffer[pf.line()] += 1;
  }}
);
taskflow.composed_of(pl);
executor.run(taskflow).wait();

SERIAL stages: one token at a time. PARALLEL stages: concurrent. num_lines = max parallelism.

Taskflow provides three pipeline variants, each suited to different situations:

`tf::Pipeline` — Fixed Stages (Compile-Time)

The example above uses tf::Pipeline, where the number and types of stages are fixed at compile time via variadic template parameters. Each tf::Pipe stage receives a tf::Pipeflow& and manages data externally (e.g., via a shared buffer indexed by pf.line()).

This is the simplest option when the pipeline structure is known ahead of time.

`tf::DataPipeline` — Typed Data Flow Between Stages

When data naturally flows from one stage to the next with different types at each stage, DataPipeline provides compile-time type safety. Each stage declares its input and output types, and Taskflow manages the data passing internally:

#include <taskflow/algorithm/data_pipeline.hpp>

tf::DataPipeline pl(num_lines,
  tf::make_data_pipe<tf::Pipeflow&, int>(tf::PipeType::SERIAL,
    [&](tf::Pipeflow& pf) -> int {
      if (pf.token() == N) { pf.stop(); return 0; }
      return static_cast<int>(pf.token());
    }),
  tf::make_data_pipe<int, std::string>(tf::PipeType::PARALLEL,
    [](int input) -> std::string {
      return std::to_string(input);        // receives int, produces string
    }),
  tf::make_data_pipe<std::string, void>(tf::PipeType::SERIAL,
    [&](std::string input) {
      results.push_back(std::move(input)); // consumes string
    })
);
taskflow.composed_of(pl);
executor.run(taskflow).wait();

Key differences from Pipeline:

Each stage callable receives the previous stage's output as its first argument (not Pipeflow&)
Stages can optionally take Pipeflow& as a second argument for token/line queries
The first stage receives only Pipeflow& (it produces data, doesn't consume)
Data is stored internally in a per-line variant buffer — no external buffer needed
The first stage must be SERIAL

DataPipeline is a good fit when stages transform data through a typed chain (e.g., read → parse → transform → write). Use regular Pipeline when stages share complex external state or when the data types are uniform.

`tf::ScalablePipeline` — Dynamic Stages (Runtime)

When the number of pipeline stages is determined at runtime (e.g., configurable filter chains, user-defined processing pipelines), ScalablePipeline accepts pipes via an iterator range:

#include <taskflow/algorithm/pipeline.hpp>

using pipe_t = tf::Pipe<std::function<void(tf::Pipeflow&)>>;
std::vector<pipe_t> pipes;

pipes.emplace_back(tf::PipeType::SERIAL, [&](tf::Pipeflow& pf) {
  if (pf.token() == N) { pf.stop(); return; }
  buffer[pf.line()] = pf.token();
});
for (size_t i = 0; i < num_filters; i++) {
  pipes.emplace_back(tf::PipeType::PARALLEL, [&](tf::Pipeflow& pf) {
    buffer[pf.line()] = process(buffer[pf.line()]);
  });
}

tf::ScalablePipeline spl(num_lines, pipes.begin(), pipes.end());
taskflow.composed_of(spl);
executor.run(taskflow).wait();

Key properties:

Pipes are provided as an iterator range [first, last) pointing to tf::Pipe objects
The pipe vector must remain valid for the pipeline's lifetime
Can be reconfigured between runs with spl.reset(new_first, new_last) or spl.reset(new_num_lines, new_first, new_last)
Move-only (not copyable)
The first stage must be SERIAL

ScalablePipeline is the right choice when pipeline structure varies at runtime. When the structure is fixed, Pipeline or DataPipeline give better compile-time guarantees and may allow the compiler to inline stage callables.

6. Advanced Patterns

6a. Recursive Fork-Join and Dynamic Task Graphs

Taskflow provides two mechanisms for spawning work dynamically at runtime. They serve different purposes and have different performance characteristics.

`tf::TaskGroup` + `corun()` — Cooperative Fork-Join

For recursive patterns where a function forks child work and waits for results (fibonacci, merge sort, tree traversal, recursive integration), TaskGroup provides lightweight cooperative waiting:

#include <taskflow/taskflow.hpp>

// Recursive body — called from within worker context
size_t fibonacci(size_t n, tf::Executor& executor) {
  if (n < 2) return n;

  size_t a, b;
  tf::TaskGroup tg = executor.task_group();

  tg.silent_async([n, &a, &executor]() {       // fork one child as async task
    a = fibonacci(n - 1, executor);
  });
  b = fibonacci(n - 2, executor);               // run other child inline (saves a task)

  tg.corun();                                    // cooperative wait
  return a + b;
}

// Entry point — must enter worker context first
size_t fibonacci_entry(tf::Executor& executor, size_t n) {
  return executor.async([n, &executor]() {
    return fibonacci(n, executor);
  }).get();
}

Key properties:

corun() is a cooperative wait — the calling worker steals and executes other tasks from the pool while waiting for its children, so no thread goes idle.
task_group() can only be called from within a worker of the executor. The initial call must enter worker context via executor.async(...).
For binary recursion, fork one child and run the other inline — this halves the number of tasks created.

For N-way fan-out (e.g., 10 children in a tree), silent_async all children and corun():

auto tg = executor.task_group();
for (int i = 0; i < N; i++) {
  tg.silent_async([&, i]() { results[i] = recurse(i, executor); });
}
tg.corun();

`tf::Subflow` — Dynamic Sub-DAGs

When you need to build a dynamic sub-graph with explicit task handles, edges, and dependencies (not just fork-join), Subflows provide a full DAG-building API:

taskflow.emplace([](tf::Subflow& subflow) {
  auto B1 = subflow.emplace([]() { /* subtask 1 */ });
  auto B2 = subflow.emplace([]() { /* subtask 2 */ });
  auto B3 = subflow.emplace([]() { /* subtask 3 */ });
  B1.precede(B3);
  B2.precede(B3);
});

Subflows materialize a full task graph with node allocations and edge wiring. They are automatically cleaned up on completion.

Choosing between them:

TaskGroup is the better fit for recursive fork-join where children are independent and you just need to wait for them all. It's lighter weight — no graph nodes are allocated, and corun() keeps workers productive during the wait.
Subflow is the better fit when you need to express dependencies between dynamically spawned tasks (e.g., B1→B3, B2→B3 above). It gives you emplace() + precede() to build arbitrary sub-DAGs.
If you find yourself using sf.join() in a recursive function with no inter-task dependencies, TaskGroup + corun() is likely a better choice.

6b. Condition Tasks — Branching and Loops

Return an int to select which successor runs (by precede() call order):

int counter = 0;
auto init = taskflow.emplace([&](){ counter = 0; });
auto body = taskflow.emplace([&](){ counter++; });
auto cond = taskflow.emplace([&]() -> int {
  return counter < 5 ? 0 : 1;  // 0 = loop back, 1 = exit
});
auto done = taskflow.emplace([&](){ /* finished */ });

init.precede(body);
body.precede(cond);
cond.precede(body);  // index 0: loop back
cond.precede(done);  // index 1: exit

Successor ordering matters. Return value 0 = first precede() target, 1 = second, etc. Returning -1 (or any value outside [0, num_successors)) selects no successor, ending that execution path.

Replacing Loop + Clear + Re-run

When source code wraps a parallel computation in a for loop, a natural but suboptimal translation is to clear and rebuild the graph each iteration:

// Suboptimal — rebuilds graph and re-launches executor N times
for (int j = 0; j < N; j++) {
  taskflow.clear();
  taskflow.for_each_index(0, size, 1, [&](int i) { compute(i); });
  executor.run(taskflow).wait();
}

A condition task eliminates this overhead by cycling the graph internally:

// Better — one graph, one launch, N iterations via condition task
auto body = taskflow.for_each_index(0, size, 1,
  [&](int i) { compute(i); }, tf::StaticPartitioner());
auto cond = taskflow.emplace([i = 0]() mutable {
  return ++i == N ? -1 : 0;    // -1 stops, 0 cycles back to body
});
body.precede(cond);
cond.precede(body);
executor.run(taskflow).wait();

The savings come from avoiding per-iteration graph construction and executor re-entry. This pattern is most valuable when the per-iteration work is fast relative to graph setup, or when N is large. For workloads where the graph structure changes between iterations, the loop + clear approach may be unavoidable.

`run_n` and `run_until` — Simpler Repeated Execution

For cases where the graph doesn't need internal looping logic, the executor provides convenience methods to repeat a taskflow externally:

// Run the taskflow exactly N times
executor.run_n(taskflow, 5).wait();

// Run until a predicate returns true
executor.run_until(taskflow, [&]() { return converged; }).wait();

Both accept an optional callback invoked after all runs complete:

executor.run_n(taskflow, 10, [](){ std::cout << "done\n"; }).wait();

These are simpler than a condition-task cycle when you don't need to interleave the repeat logic with other tasks in the graph. The condition-task approach is more flexible — it can incorporate state from the graph itself and coexist with other tasks and branches in the same DAG.

Multi-Condition Tasks — Firing Multiple Successors

A standard condition task returns a single int to select one successor. A multi-condition task returns tf::SmallVector<int> to activate multiple successors simultaneously:

auto task = taskflow.emplace([]() -> tf::SmallVector<int> {
  return {0, 2};  // fire the 1st and 3rd successors
});
task.precede(A, B, C);  // A = index 0, B = index 1, C = index 2
// Both A and C will run; B will not

Taskflow automatically detects the return type — returning int creates a single-condition task, returning tf::SmallVector<int> creates a multi-condition task. This is useful for fork-like control flow where a decision node needs to trigger a subset of its successors based on runtime conditions.

6c. Composition — Reusable Modules

tf::Taskflow module_tf("module");
module_tf.emplace([]() { /* work */ });

tf::Taskflow main_tf("main");
auto setup = main_tf.emplace([]() { /* setup */ });
auto module = main_tf.composed_of(module_tf);  // embed module
auto finish = main_tf.emplace([]() { /* finish */ });
setup.precede(module);
module.precede(finish);

The embedded taskflow must outlive the parent.

6d. Async Tasks

// Fire-and-forget
std::future<int> result = executor.async([]() { return 42; });
executor.silent_async([]() { /* no future returned */ });
executor.wait_for_all();

// Dependent async — ordering without a full taskflow
auto [A, fuA] = executor.dependent_async([]() { printf("A\n"); });
auto [B, fuB] = executor.dependent_async([]() { printf("B\n"); }, A);
auto [C, fuC] = executor.dependent_async([]() { printf("C\n"); }, A);
auto [D, fuD] = executor.dependent_async([]() { printf("D\n"); }, B, C);
fuD.get();

Cooperative Execution from Within a Worker

When code is already running inside an executor worker (e.g., inside a task lambda or after executor.async()), calling executor.run(other_taskflow).wait() would block the worker thread. executor.corun() provides a cooperative alternative:

taskflow.emplace([&](tf::Runtime& rt) {
  tf::Taskflow other;
  other.emplace([]() { /* work */ });
  rt.executor().corun(other);  // cooperative — worker steals while waiting
});

corun(target) accepts any object with a Graph& graph() method (including tf::Taskflow). The calling worker participates in work-stealing while the target graph executes, preventing thread starvation. It can only be called from within a worker context — calling it from outside an executor (e.g., from main()) will throw an exception.

There is also executor.corun_until(predicate) for waiting on arbitrary conditions:

taskflow.emplace([&](tf::Runtime& rt) {
  auto fu = std::async([]() { /* external work */ });
  rt.executor().corun_until([&]() {
    return fu.wait_for(std::chrono::seconds(0)) == std::future_status::ready;
  });
});

The worker keeps stealing and executing tasks from the pool while periodically checking the predicate, rather than blocking idle.

6e. Semaphores — Limiting Concurrency

tf::Semaphore semaphore(1);  // allow at most 1 concurrent task
auto A = taskflow.emplace([]() { /* critical section */ });
A.acquire(semaphore);
A.release(semaphore);

6f. Large-Scale Explicit DAGs

When algorithm patterns don't apply (simulations, iterative solvers, HPC), you build manual task graphs with emplace() + precede(). At scale (100K+ tasks), this has unique correctness and performance considerations.

The Dependency Tracking Pattern

Track which task last wrote to each memory location for correct precede() edges:

tf::Executor executor(num_workers);
tf::Taskflow taskflow;

// Flat vector for O(1) lookup — index by memory location ID
std::vector<tf::Task> last_writer(num_locations);
std::vector<bool> has_writer(num_locations, false);

for (/* each task */) {
  int out_loc = /* output location */;
  std::vector<int> in_locs = /* input locations */;

  tf::Task task = taskflow.emplace([/* capture */]() { /* body */ });

  // RAW: wait for producer of each input
  for (int in : in_locs) {
    if (has_writer[in]) last_writer[in].precede(task);
  }
  // WAW: wait for previous writer of same output
  if (has_writer[out_loc]) last_writer[out_loc].precede(task);

  last_writer[out_loc] = task;
  has_writer[out_loc] = true;
}
executor.run(taskflow).wait();

Strongly prefer flat vectors over std::map/std::unordered_map for these lookups. Tree-based containers add O(log n) pointer-chasing per lookup with cache misses at each level. In benchmarks, std::map for slot tracking added ~44% overhead to DAG construction vs flat-vector indexing. If your location IDs aren't contiguous integers, consider a translation step to map them to dense indices rather than using a hash map directly.

WAR Dependencies — The Hidden Correctness Bug

The pattern above handles RAW and WAW but misses WAR (Write-After-Read). When tasks share physical storage reused across iterations (circular buffers, double-buffering, field % nb_fields indexing), a new writer can overwrite data that earlier tasks still need.

This produces silently wrong results — no crash, no data race detector warning.

Fix: Track readers since the last writer. Before a new write, add edges from all readers:

std::vector<std::vector<tf::Task>> tile_readers(num_locations);

// When creating a task that READS location `in`:
if (has_writer[in]) last_writer[in].precede(task);  // RAW
tile_readers[in].push_back(task);

// When creating a task that WRITES location `out`:
for (auto &reader : tile_readers[out]) reader.precede(task);  // WAR
tile_readers[out].clear();
if (has_writer[out]) last_writer[out].precede(task);           // WAW
last_writer[out] = task;
has_writer[out] = true;

When you need WAR tracking: any time the same physical memory is written at different logical timesteps. Common: circular buffers, double-buffering, in-place iterative updates.

When you don't: every task writes to a unique location that no future task overwrites.

Converting OpenMP `depend` to Taskflow `precede`

OpenMP	Taskflow equivalent
`depend(in: x)`	RAW: `last_writer[x].precede(task)`
`depend(out: x)`	WAW + WAR: precede from last writer AND all readers of `x`
`depend(inout: x)`	Same as `out` — both reads and writes `x`

OpenMP handles RAW, WAW, and WAR automatically. With Taskflow, you track all three explicitly using the last-writer + tile-readers pattern above.

Monolithic vs Batched DAG Execution

A monolithic DAG (single tf::Taskflow, one executor.run().wait()) is generally faster than batching because each run().wait() is a full barrier where all workers synchronize. Monolithic lets the work-stealing scheduler see all tasks at once for better load balancing.

In benchmarks across 300K-5M tasks at 80 cores, monolithic was consistently 20-42% faster than batched approaches. However, monolithic memory usage scales linearly at roughly ~0.5 GB per 1M tasks for DAG metadata. At 5M tasks, expect 2.5-3.5 GB.

Batched pattern (for memory-constrained environments):

tf::Executor executor(num_workers);
tf::Taskflow taskflow;
int batch_size = 32;  // timesteps per batch

for (long y = 0; y < total_timesteps; y += batch_size) {
  taskflow.clear();
  std::fill(has_writer.begin(), has_writer.end(), false);
  for (auto &v : tile_readers) v.clear();

  long end = std::min(y + batch_size, total_timesteps);
  for (long t = y; t < end; t++) {
    // build tasks for this timestep, wire dependencies
  }
  executor.run(taskflow).wait();
}

Batch size 20-64 is a reasonable starting point. Larger batches give more cross-timestep parallelism; smaller batches use less memory. Profile to find the sweet spot for your workload.

Rolling Window Task Handles

Instead of storing all task handles (tasks[timestep][point], O(timesteps × width) memory), use prev/curr vectors and swap each timestep:

std::vector<tf::Task> prev_tasks(max_width), curr_tasks(max_width);
std::vector<bool> prev_valid(max_width, false), curr_valid(max_width, false);
// after each timestep:
std::swap(prev_tasks, curr_tasks);
std::swap(prev_valid, curr_valid);

This reduces task handle memory from O(timesteps × width) to O(width). Note that tf::Task is just a Node* pointer (8 bytes), so the full array at 5M tasks is only ~40MB — well within L3 cache on server CPUs. Rolling windows help RSS but don't improve construction speed.

Lambda Capture Optimization

At millions of tasks, what you capture in each lambda affects construction cost. The std::function SBO threshold is only 16 bytes (libstdc++), so most captures trigger a second heap allocation regardless. But capture size still matters — copying 240 bytes per task is measurably slower than 48 bytes due to memcpy cost and L1/L2 cache pollution during single-threaded construction. Target ≤48 bytes per capture.

Best pattern — pointer+index into pre-allocated metadata:

// Pre-allocate metadata and input storage BEFORE the construction loop
struct TaskMeta {
  const TaskGraph *graph;
  char *output_ptr;
  char *buf_base;
  long timestep, point;
  size_t out_bytes, scratch_bytes;
  int n_inputs, input_offset;  // offset into flat input array
};
std::vector<TaskMeta> all_meta;
all_meta.reserve(total_tasks);         // pre-allocate
std::vector<long> all_input_slots;
all_input_slots.reserve(total_tasks * 4);  // estimate

// In the construction loop:
int idx = (int)all_meta.size();
all_meta.push_back({&graph, out_ptr, base, t, p, ...});
// ... push input slots into all_input_slots ...

// Lambda captures only pointers + one index (~24 bytes, trivially copyable)
auto *meta_arr = all_meta.data();
auto *slots_arr = all_input_slots.data();
taskflow.emplace([meta_arr, slots_arr, idx]() {
  const TaskMeta &m = meta_arr[idx];  // reconstruct everything from metadata
  const char *in_ptrs[256];           // build input arrays ON THE STACK
  for (int i = 0; i < m.n_inputs; i++)
    in_ptrs[i] = m.buf_base + slots_arr[m.input_offset + i] * m.out_bytes;
  // ... call execute_point ...
});

Anti-pattern — fat struct with embedded arrays (~200+ bytes per capture):

// DON'T: large fixed arrays bloat every capture
struct TaskCapture {
  const char *input_ptrs[64];   // 512 bytes!
  size_t input_sizes[64];       // 512 bytes!
  // ... plus metadata fields
};  // 1KB+ per task × millions of tasks = slow construction + cache thrashing

Avoid per-task heap allocations:

Capturing std::vector by value → heap alloc for its internal buffer
Copying structs containing vectors → each vector's buffer is heap-allocated
std::shared_ptr captures → control block is heap-allocated

Taskflow already does one mandatory heap allocation per emplace() (new Node). Each additional allocation adds ~50-150ns. At 1M tasks with a struct containing 3 vectors captured by value, that's 3M extra allocations on top of the mandatory 1M.

Rules of thumb:

Target ≤48 bytes per lambda capture (3 pointers + 1 index + 1-2 scalars)
Pre-allocate metadata and input arrays, then capture pointer+index into them
Build input pointer arrays on the stack inside the lambda, not in the capture
Capture pointers (const T*) or references, not copies of large objects
The capture struct should ideally be trivially copyable — no heap-owning members
Mark the lambda mutable if the called function needs non-const pointers from your struct

Large struct copies: Never copy objects containing heap-allocated members (like std::vector) into per-task captures. Each copy triggers deep copies of all internal buffers. Instead, capture by reference or pointer — safe as long as the data outlives executor.run().wait():

// Prefer reference or pointer — zero cost:
const Config &cfg = configs[i];
taskflow.emplace([&cfg, ...]() { cfg.process(...); });

const Config *cfg_ptr = &configs[i];
taskflow.emplace([cfg_ptr, ...]() { cfg_ptr->process(...); });

`shared_ptr` in Task Captures

Prefer raw pointers or value captures of plain structs over shared_ptr in task construction loops. Each shared_ptr copy triggers an atomic reference count increment during construction and decrement during destruction. At 500K tasks, that's 1M atomic operations. Additionally, make_shared adds a second heap allocation per task on top of Taskflow's mandatory new Node().

In benchmarks, shared_ptr per task added ~18% to DAG construction overhead vs raw pointer captures. This is moderate — not catastrophic — but easily avoided:

// Prefer — raw pointer with externally managed lifetime:
TaskData data(/* ... */);
taskflow.emplace([&data]() { data.execute(); });

// Also good — value capture of plain struct:
struct Cap { int x; float* buf; size_t n; };
Cap cap = {42, buffer.data(), buffer.size()};
taskflow.emplace([cap]() { /* use cap */ });

Use shared_ptr when lifetime management genuinely requires it (e.g., dynamically-sized per-task data). Don't use it simply to avoid thinking about lifetimes — executor.run().wait() provides a natural lifetime boundary.

Multi-Graph Single DAG

When processing multiple independent graphs, build them all into a single tf::Taskflow:

// Faster — all graphs in one DAG (1 barrier):
tf::Taskflow taskflow;
for (auto& graph : graphs) {
    build_tasks(taskflow, graph);
}
executor.run(taskflow).wait();

// Slower — one DAG per graph (N barriers):
for (auto& graph : graphs) {
    tf::Taskflow taskflow;
    build_tasks(taskflow, graph);
    executor.run(taskflow).wait();
}

Independent graphs have no edges between them, so the work-stealing scheduler automatically interleaves their tasks. Each extra run().wait() is a full-barrier synchronization.

6g. Runtime Tasks — Manual Scheduling and Corun

A runtime task provides direct executor access within a running task:

std::atomic<int> counter{0};
taskflow.emplace([&](tf::Runtime& rt) {
  for (int i = 0; i < 100; i++) {
    rt.silent_async([&]() { counter.fetch_add(1, std::memory_order_relaxed); });
  }
  rt.corun();  // block until all async tasks from this runtime finish
});

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taskflow Patterns & Recipes

4. Converting Sequential Code

4a. Loop Iterations → `for_each` / `for_each_index` / `for_each_by_index`

4b. Accumulation → `reduce` / `reduce_by_index` / `transform_reduce`

4c. Sorting → `sort`

4d. Prefix Sums → `inclusive_scan` / `exclusive_scan`

4e. Multi-Stage Streaming → Pipeline

`tf::Pipeline` — Fixed Stages (Compile-Time)

`tf::DataPipeline` — Typed Data Flow Between Stages

`tf::ScalablePipeline` — Dynamic Stages (Runtime)

6. Advanced Patterns

6a. Recursive Fork-Join and Dynamic Task Graphs

`tf::TaskGroup` + `corun()` — Cooperative Fork-Join

`tf::Subflow` — Dynamic Sub-DAGs

6b. Condition Tasks — Branching and Loops

Replacing Loop + Clear + Re-run

`run_n` and `run_until` — Simpler Repeated Execution

Multi-Condition Tasks — Firing Multiple Successors

6c. Composition — Reusable Modules

6d. Async Tasks

Cooperative Execution from Within a Worker

6e. Semaphores — Limiting Concurrency

6f. Large-Scale Explicit DAGs

The Dependency Tracking Pattern

WAR Dependencies — The Hidden Correctness Bug

Converting OpenMP `depend` to Taskflow `precede`

Monolithic vs Batched DAG Execution

Rolling Window Task Handles

Lambda Capture Optimization

`shared_ptr` in Task Captures

Multi-Graph Single DAG

6g. Runtime Tasks — Manual Scheduling and Corun

FilesExpand file tree

taskflow-patterns.md

Latest commit

History

taskflow-patterns.md

File metadata and controls

Taskflow Patterns & Recipes

4. Converting Sequential Code

4a. Loop Iterations → for_each / for_each_index / for_each_by_index

4b. Accumulation → reduce / reduce_by_index / transform_reduce

4c. Sorting → sort

4d. Prefix Sums → inclusive_scan / exclusive_scan

4e. Multi-Stage Streaming → Pipeline

tf::Pipeline — Fixed Stages (Compile-Time)

tf::DataPipeline — Typed Data Flow Between Stages

tf::ScalablePipeline — Dynamic Stages (Runtime)

6. Advanced Patterns

6a. Recursive Fork-Join and Dynamic Task Graphs

tf::TaskGroup + corun() — Cooperative Fork-Join

tf::Subflow — Dynamic Sub-DAGs

6b. Condition Tasks — Branching and Loops

Replacing Loop + Clear + Re-run

run_n and run_until — Simpler Repeated Execution

Multi-Condition Tasks — Firing Multiple Successors

6c. Composition — Reusable Modules

6d. Async Tasks

Cooperative Execution from Within a Worker

6e. Semaphores — Limiting Concurrency

6f. Large-Scale Explicit DAGs

The Dependency Tracking Pattern

WAR Dependencies — The Hidden Correctness Bug

Converting OpenMP depend to Taskflow precede

Monolithic vs Batched DAG Execution

Rolling Window Task Handles

Lambda Capture Optimization

shared_ptr in Task Captures

Multi-Graph Single DAG

6g. Runtime Tasks — Manual Scheduling and Corun

4a. Loop Iterations → `for_each` / `for_each_index` / `for_each_by_index`

4b. Accumulation → `reduce` / `reduce_by_index` / `transform_reduce`

4c. Sorting → `sort`

4d. Prefix Sums → `inclusive_scan` / `exclusive_scan`

`tf::Pipeline` — Fixed Stages (Compile-Time)

`tf::DataPipeline` — Typed Data Flow Between Stages

`tf::ScalablePipeline` — Dynamic Stages (Runtime)

`tf::TaskGroup` + `corun()` — Cooperative Fork-Join

`tf::Subflow` — Dynamic Sub-DAGs

`run_n` and `run_until` — Simpler Repeated Execution

Converting OpenMP `depend` to Taskflow `precede`

`shared_ptr` in Task Captures