perf(leiden): replace per-row GrB_Col_extract with CSR walk by gkorland · Pull Request #406 · GraphBLAS/LAGraph

gkorland · 2026-05-04T21:17:09Z

Summary

Addresses Dr. Davis's review feedback that the initial Leiden implementation "doesn't exploit much of GraphBLAS" and is therefore slow. The dominant cost was per-node GraphBLAS overhead inside the Phase 1 / Phase 2 hot loops; this PR removes it without changing algorithm semantics.

Changes

CSR materialization per level. Once per outer aggregation level, materialize A_cur into portable CSR arrays (Ap, Aj, Ax) via a single GrB_Matrix_extractTuples_FP64 + counting-sort scatter. The Phase 1 / Phase 2 inner loops now walk Aj[Ap[i]..Ap[i+1]) directly instead of calling GrB_Col_extract + GrB_Vector_nvals + GrB_Vector_extractTuples_FP64 for every node × every inner iteration × both phases (the dominant cost).
k_arr from CSR. Compute degree directly from CSR row sums; remove the per-level GrB_Matrix_reduce_Monoid + per-element GrB_Vector_extractElement_FP64 loop.
GrB_Matrix_build_FP64 for S_mat. Single-call construction instead of n_cur GrB_Matrix_setElement_FP64 calls + implicit GrB_Matrix_wait. Existing scratch buffers (nbrs_j, dirty_list, nbrs_v) are reused for the build inputs.
GrB_Vector_build_INT64 for output. Single-call construction instead of n GrB_Vector_setElement_INT64 calls (also fixed in the empty-graph fallback).
CSR/tuple buffers grow on demand and are released through LG_FREE_WORK on every error path.

Algorithm semantics (node iteration order, tie-breaking with > vs >=, self-loop skipping, modularity formula, Leiden parent-community refinement constraint) are preserved.

Testing

test_leiden still passes:

====== karate.mtx ======
  Modularity Q = 0.418803  (K_ref <= 34)
[ OK ]
SUCCESS: All unit tests have passed.

Q ≈ 0.42 on Zachary's karate club matches published Louvain/Leiden benchmarks for that graph.

Memory / Performance Impact

Per-level work in the local-move and refinement phases drops from O(n_cur · iters · GrB-call-overhead) to O(nnz · iters) with no GraphBLAS overhead in the hot loop.
Additional memory: 6 arrays of size max(nnz) (Ap, Aj, Ax, I_tup, J_tup, X_tup) plus an n-sized cursor and an iota index array. Buffers grow on demand and are reused across levels.
No new GraphBLAS feature dependencies (no GxB unpack/pack); uses only standard GrB_* API.

Related Issues

Follow-up to #401 (community: implement Leiden community detection).

Address Dr. Davis's review feedback that the initial Leiden implementation 'doesn't exploit much of GraphBLAS' and is therefore slow. The dominant cost was per-node GraphBLAS overhead inside the Phase 1 / Phase 2 hot loops. Optimizations (algorithm semantics, tie-breaking, self-loop handling and modularity formula are all unchanged): * Once per outer aggregation level, materialize A_cur into portable CSR arrays (Ap, Aj, Ax) via a single GrB_Matrix_extractTuples_FP64 + counting-sort scatter. The Phase 1 / Phase 2 inner loops now walk Aj[Ap[i]..Ap[i+1]) directly instead of calling GrB_Col_extract + GrB_Vector_nvals + GrB_Vector_extractTuples_FP64 for every node on every iteration of both phases. * Compute k_arr[i] from CSR row sums; remove the per-level GrB_Matrix_reduce_Monoid + per-element GrB_Vector_extractElement_FP64 loop. * Build S_mat via a single GrB_Matrix_build_FP64 instead of n_cur GrB_Matrix_setElement_FP64 calls + an implicit GrB_Matrix_wait. Existing scratch buffers (nbrs_j, dirty_list, nbrs_v) are reused for the build inputs. * Build the output GrB_Vector via GrB_Vector_build_INT64 instead of n GrB_Vector_setElement_INT64 calls (also fixed in the empty-graph fallback). * CSR/tuple buffers grow on demand and are released through LG_FREE_WORK on every error path. Asymptotic per-level cost in the local-move and refinement phases drops from O(n_cur * iters * GrB-call-overhead) to O(nnz * iters) with no GraphBLAS overhead in the hot loop. test_leiden still passes (Q = 0.4188 on karate, well above the 0.37 acceptance threshold). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

GomezGab

From what I can tell, the algorithm doesn't exploit GraphBLAS for speed/parallelism on the most intensive parts. I don't yet understand Leiden enough to give advice as to how to effectively exploit GraphBLAS in these places.

@GomezGab

Address review feedback from @GomezGab on PR GraphBLAS#406: * Replace GrB_Matrix_extractTuples_FP64 + counting-sort scatter with GxB_unload_Matrix_into_Container, giving direct pointer access to the matrix's internal CSR arrays (Ap/Aj/Ax) with no copy. Force sparse + row-major + non-iso + 64-bit indices via GrB_Matrix_set_INT32 hints + GrB_wait(GrB_MATERIALIZE), then unload through the container's p/i/x vectors with GxB_Vector_unload. Reload the container before the Phase-3 mxm so A_cur is once again a usable GraphBLAS matrix; null our raw-array pointers after reload to avoid double-free. * Compute degrees via GrB_Matrix_reduce_Monoid with GrB_PLUS_FP64 accum on a pre-zero-filled k_vec (handles isolated rows), then GxB_Vector_unload to obtain k_arr. k_vec is recreated at size n_cur each outer level. * Use GxB_Matrix_build_Scalar with a shared GrB_Scalar (=1.0) for the S_mat indicator matrix; reuse iota for row indices and reinterpret-cast c_ref (int64_t*) directly as GrB_Index* (values are non-negative community labels), removing the per-level value array and the temporary c_ref copy. * Use GxB_Vector_load with move semantics to hand the o_comm buffer directly to the output vector (with GxB_FULL sparsity), avoiding the final GrB_Vector_build_INT64 copy. * Duplicate G->A as GrB_FP64 once at start (after the m == 0 early return) so A_cur is always owned, FP64-typed, and may be unloaded. Required because G->A may be any type (BOOL on pattern-only Matrix Market inputs, INT*, FP32, ...); typecasting once removes the need for per-edge type handling in the inner loops. test_leiden on karate.mtx: Q = 0.418803, identical to baseline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The previous commit unconditionally used GxB_Container, GxB_Vector_load, GxB_Vector_unload, and GxB_load_Matrix_from_Container, which are only available in SuiteSparse:GraphBLAS v10.0.0 and newer. CI tests against both v9.0.0 and v10.2.0, so the v9 build failed with implicit-declaration errors. Wrap the Container-based fast path in '#if LAGR_LEIDEN_USE_CONTAINER' (defined to 1 when GxB_IMPLEMENTATION >= GxB_VERSION(10,0,0), else 0) and provide a v9 fallback that materializes CSR via GrB_Matrix_extractTuples_FP64 + counting-sort scatter (the same approach as commit 9a1daf0). Both code paths share: FP64 typecast of G->A into A_agg, GrB_Matrix reduce for degrees, GxB_Matrix_build_Scalar for the indicator matrix S_mat (build_Scalar exists in v9), and the rest of the algorithm. The macro can be overridden with -DLAGR_LEIDEN_USE_CONTAINER=0/1 to exercise either path on the same SuiteSparse install (used to verify the fallback compiles on the v10 dev machine). test_leiden on karate.mtx: Q = 0.418803 (unchanged from baseline). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

GomezGab

As I understand it, we are starting to move away from implementing vanilla GraphBLAS methods and are generally okay returning GrB_NOT_IMPLEMENTED where Suitesparse is not available, and it would be excessively inconvenient to implement with the vanilla methods (ie. having to build the csr arrays by hand). @DrTimothyAldenDavis please confirm this.

GomezGab · 2026-05-09T15:08:39Z

+        Ap = (GrB_Index *) pv ;
+        Aj = (GrB_Index *) iv ;
+        Ax = (double    *) xv ;
+#else


Fall back should use GrB_unpack when possible. or simply refuse (GrB_NOT_IMPLEMENTED)

DrTimothyAldenDavis · 2026-05-09T16:07:26Z

We don't need vanilla methds. The entire function can just return GrB_NOT_IMPLEMENTED.

I would like to simplify LAGraph and remove all notion of the "vanilla" methods, and just explicitly rely on SuiteSparse GraphBLAS only. That's something for the LAGraph committee to decide, however.

GomezGab requested changes May 4, 2026

View reviewed changes

Comment thread experimental/algorithm/LAGraph_Leiden.c

Comment thread experimental/algorithm/LAGraph_Leiden.c Outdated

Comment thread experimental/algorithm/LAGraph_Leiden.c Outdated

Comment thread experimental/algorithm/LAGraph_Leiden.c Outdated

gkorland requested a review from GomezGab May 5, 2026 06:17

gkorland and others added 2 commits May 5, 2026 09:27

simplify logic in leiden code

5a71445

GomezGab requested changes May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(leiden): replace per-row GrB_Col_extract with CSR walk#406

perf(leiden): replace per-row GrB_Col_extract with CSR walk#406
gkorland wants to merge 4 commits into
GraphBLAS:v1.3.xfrom
gkorland:perf/leiden-graphblas-csr

gkorland commented May 4, 2026

Uh oh!

GomezGab left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GomezGab left a comment

Uh oh!

GomezGab May 9, 2026

Uh oh!

DrTimothyAldenDavis commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gkorland commented May 4, 2026

Summary

Changes

Testing

Memory / Performance Impact

Related Issues

Uh oh!

GomezGab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GomezGab left a comment

Choose a reason for hiding this comment

Uh oh!

GomezGab May 9, 2026

Choose a reason for hiding this comment

Uh oh!

DrTimothyAldenDavis commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants