perf(leiden): replace per-row GrB_Col_extract with CSR walk#406
perf(leiden): replace per-row GrB_Col_extract with CSR walk#406gkorland wants to merge 4 commits into
Conversation
Address Dr. Davis's review feedback that the initial Leiden implementation 'doesn't exploit much of GraphBLAS' and is therefore slow. The dominant cost was per-node GraphBLAS overhead inside the Phase 1 / Phase 2 hot loops. Optimizations (algorithm semantics, tie-breaking, self-loop handling and modularity formula are all unchanged): * Once per outer aggregation level, materialize A_cur into portable CSR arrays (Ap, Aj, Ax) via a single GrB_Matrix_extractTuples_FP64 + counting-sort scatter. The Phase 1 / Phase 2 inner loops now walk Aj[Ap[i]..Ap[i+1]) directly instead of calling GrB_Col_extract + GrB_Vector_nvals + GrB_Vector_extractTuples_FP64 for every node on every iteration of both phases. * Compute k_arr[i] from CSR row sums; remove the per-level GrB_Matrix_reduce_Monoid + per-element GrB_Vector_extractElement_FP64 loop. * Build S_mat via a single GrB_Matrix_build_FP64 instead of n_cur GrB_Matrix_setElement_FP64 calls + an implicit GrB_Matrix_wait. Existing scratch buffers (nbrs_j, dirty_list, nbrs_v) are reused for the build inputs. * Build the output GrB_Vector via GrB_Vector_build_INT64 instead of n GrB_Vector_setElement_INT64 calls (also fixed in the empty-graph fallback). * CSR/tuple buffers grow on demand and are released through LG_FREE_WORK on every error path. Asymptotic per-level cost in the local-move and refinement phases drops from O(n_cur * iters * GrB-call-overhead) to O(nnz * iters) with no GraphBLAS overhead in the hot loop. test_leiden still passes (Q = 0.4188 on karate, well above the 0.37 acceptance threshold). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GomezGab
left a comment
There was a problem hiding this comment.
From what I can tell, the algorithm doesn't exploit GraphBLAS for speed/parallelism on the most intensive parts. I don't yet understand Leiden enough to give advice as to how to effectively exploit GraphBLAS in these places.
Address review feedback from @GomezGab on PR GraphBLAS#406: * Replace GrB_Matrix_extractTuples_FP64 + counting-sort scatter with GxB_unload_Matrix_into_Container, giving direct pointer access to the matrix's internal CSR arrays (Ap/Aj/Ax) with no copy. Force sparse + row-major + non-iso + 64-bit indices via GrB_Matrix_set_INT32 hints + GrB_wait(GrB_MATERIALIZE), then unload through the container's p/i/x vectors with GxB_Vector_unload. Reload the container before the Phase-3 mxm so A_cur is once again a usable GraphBLAS matrix; null our raw-array pointers after reload to avoid double-free. * Compute degrees via GrB_Matrix_reduce_Monoid with GrB_PLUS_FP64 accum on a pre-zero-filled k_vec (handles isolated rows), then GxB_Vector_unload to obtain k_arr. k_vec is recreated at size n_cur each outer level. * Use GxB_Matrix_build_Scalar with a shared GrB_Scalar (=1.0) for the S_mat indicator matrix; reuse iota for row indices and reinterpret-cast c_ref (int64_t*) directly as GrB_Index* (values are non-negative community labels), removing the per-level value array and the temporary c_ref copy. * Use GxB_Vector_load with move semantics to hand the o_comm buffer directly to the output vector (with GxB_FULL sparsity), avoiding the final GrB_Vector_build_INT64 copy. * Duplicate G->A as GrB_FP64 once at start (after the m == 0 early return) so A_cur is always owned, FP64-typed, and may be unloaded. Required because G->A may be any type (BOOL on pattern-only Matrix Market inputs, INT*, FP32, ...); typecasting once removes the need for per-edge type handling in the inner loops. test_leiden on karate.mtx: Q = 0.418803, identical to baseline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous commit unconditionally used GxB_Container, GxB_Vector_load, GxB_Vector_unload, and GxB_load_Matrix_from_Container, which are only available in SuiteSparse:GraphBLAS v10.0.0 and newer. CI tests against both v9.0.0 and v10.2.0, so the v9 build failed with implicit-declaration errors. Wrap the Container-based fast path in '#if LAGR_LEIDEN_USE_CONTAINER' (defined to 1 when GxB_IMPLEMENTATION >= GxB_VERSION(10,0,0), else 0) and provide a v9 fallback that materializes CSR via GrB_Matrix_extractTuples_FP64 + counting-sort scatter (the same approach as commit 9a1daf0). Both code paths share: FP64 typecast of G->A into A_agg, GrB_Matrix reduce for degrees, GxB_Matrix_build_Scalar for the indicator matrix S_mat (build_Scalar exists in v9), and the rest of the algorithm. The macro can be overridden with -DLAGR_LEIDEN_USE_CONTAINER=0/1 to exercise either path on the same SuiteSparse install (used to verify the fallback compiles on the v10 dev machine). test_leiden on karate.mtx: Q = 0.418803 (unchanged from baseline). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GomezGab
left a comment
There was a problem hiding this comment.
As I understand it, we are starting to move away from implementing vanilla GraphBLAS methods and are generally okay returning GrB_NOT_IMPLEMENTED where Suitesparse is not available, and it would be excessively inconvenient to implement with the vanilla methods (ie. having to build the csr arrays by hand). @DrTimothyAldenDavis please confirm this.
| Ap = (GrB_Index *) pv ; | ||
| Aj = (GrB_Index *) iv ; | ||
| Ax = (double *) xv ; | ||
| #else |
There was a problem hiding this comment.
Fall back should use GrB_unpack when possible. or simply refuse (GrB_NOT_IMPLEMENTED)
|
We don't need vanilla methds. The entire function can just return GrB_NOT_IMPLEMENTED. I would like to simplify LAGraph and remove all notion of the "vanilla" methods, and just explicitly rely on SuiteSparse GraphBLAS only. That's something for the LAGraph committee to decide, however. |
Summary
Addresses Dr. Davis's review feedback that the initial Leiden implementation "doesn't exploit much of GraphBLAS" and is therefore slow. The dominant cost was per-node GraphBLAS overhead inside the Phase 1 / Phase 2 hot loops; this PR removes it without changing algorithm semantics.
Changes
A_curinto portable CSR arrays (Ap,Aj,Ax) via a singleGrB_Matrix_extractTuples_FP64+ counting-sort scatter. The Phase 1 / Phase 2 inner loops now walkAj[Ap[i]..Ap[i+1])directly instead of callingGrB_Col_extract+GrB_Vector_nvals+GrB_Vector_extractTuples_FP64for every node × every inner iteration × both phases (the dominant cost).GrB_Matrix_reduce_Monoid+ per-elementGrB_Vector_extractElement_FP64loop.GrB_Matrix_build_FP64for S_mat. Single-call construction instead ofn_curGrB_Matrix_setElement_FP64calls + implicitGrB_Matrix_wait. Existing scratch buffers (nbrs_j,dirty_list,nbrs_v) are reused for the build inputs.GrB_Vector_build_INT64for output. Single-call construction instead ofnGrB_Vector_setElement_INT64calls (also fixed in the empty-graph fallback).LG_FREE_WORKon every error path.Algorithm semantics (node iteration order, tie-breaking with
>vs>=, self-loop skipping, modularity formula, Leiden parent-community refinement constraint) are preserved.Testing
test_leidenstill passes:Q ≈ 0.42 on Zachary's karate club matches published Louvain/Leiden benchmarks for that graph.
Memory / Performance Impact
O(n_cur · iters · GrB-call-overhead)toO(nnz · iters)with no GraphBLAS overhead in the hot loop.max(nnz)(Ap,Aj,Ax,I_tup,J_tup,X_tup) plus ann-sized cursor and aniotaindex array. Buffers grow on demand and are reused across levels.GrB_*API.Related Issues
Follow-up to #401 (community: implement Leiden community detection).