Skip to content

Latest commit

 

History

History
541 lines (410 loc) · 26.2 KB

File metadata and controls

541 lines (410 loc) · 26.2 KB

AGENTS.md — Pathrex

Project Overview

Pathrex is a Rust library and CLI tool for benchmarking queries on edge-labeled graphs constrained by regular languages and context-free languages. It uses SuiteSparse:GraphBLAS (via LAGraph) for sparse Boolean matrix operations and decomposes a graph by edge label into one Boolean adjacency matrix per label.

Repository Layout

pathrex/
├── Cargo.toml                  # Crate manifest (edition 2024)
├── build.rs                    # Links LAGraph + LAGraphX; optionally regenerates FFI bindings
├── src/
│   ├── lib.rs                  # Modules: eval, formats, graph, rpq, sparql, utils, lagraph_sys
│   ├── main.rs                 # Binary entry point (placeholder)
│   ├── lagraph_sys.rs          # FFI module — includes generated bindings
│   ├── lagraph_sys_generated.rs# Bindgen output (checked in, regenerated in CI)
│   ├── utils.rs                # Public helpers: CountingBuilder, CountOutput, VecSource,
│   │                           #   grb_ok! and la_ok! macros, build_graph
│   ├── graph/
│   │   ├── mod.rs              # Core traits (GraphBuilder, GraphDecomposition, GraphSource,
│   │   │                       #   Backend, Graph<B>), error types, RAII wrappers, GrB init
│   │   └── inmemory.rs         # InMemory marker, InMemoryBuilder, InMemoryGraph
│   ├── eval/
│   │   └── mod.rs              # Evaluator, PreparedEvaluator, ResultCount traits
│   ├── rpq/
│   │   ├── mod.rs              # RPQ query types, RpqError, RPQ marker subtraits
│   │   ├── nfarpq.rs           # NfaRpqEvaluator (LAGraph_RegularPathQuery)
│   │   └── rpqmatrix.rs        # Matrix-plan RPQ evaluator
│   ├── sparql/
│   │   └── mod.rs              # parse_rpq / extract_rpq → RpqQuery (spargebra)
│   └── formats/
│       ├── mod.rs              # FormatError enum, re-exports
│       ├── csv.rs              # Csv<R> — CSV → Edge iterator (CsvConfig, ColumnSpec)
│       ├── mm.rs               # MatrixMarket directory loader (vertices.txt, edges.txt, *.txt)
│       └── rdf.rs              # Rdf — unified RDF parser (N-Triples, Turtle) → Edge iterator
├── tests/
│   ├── inmemory_tests.rs       # Integration tests for InMemoryBuilder / InMemoryGraph
│   ├── mm_tests.rs             # Integration tests for MatrixMarket format
│   ├── nfarpq_tests.rs         # Integration tests for NfaRpqEvaluator
│   └── rpqmatrix_tests.rs      # Integration tests for matrix-plan RPQ evaluator
├── deps/
│   └── LAGraph/                # Git submodule (SparseLinearAlgebra/LAGraph)
└── .github/workflows/ci.yml   # CI: build GraphBLAS + LAGraph, cargo build & test

Build & Dependencies

System prerequisites

Dependency Purpose
SuiteSparse:GraphBLAS Sparse matrix engine (libgraphblas)
LAGraph Graph algorithm library on top of GraphBLAS (liblagraph)
cmake Building LAGraph from source
libclang-dev / clang Required by bindgen when regenerate-bindings feature is active

Building

# Ensure submodules are present
git submodule update --init --recursive

# Build and install SuiteSparse:GraphBLAS system-wide
git clone --depth 1 https://github.com/DrTimothyAldenDavis/GraphBLAS.git
cd GraphBLAS && make compact && sudo make install && cd ..

# Build LAGraph inside the submodule (no system-wide install required)
cd deps/LAGraph && make && cd ../..

# Build pathrex
cargo build

# Run tests
LD_LIBRARY_PATH=deps/LAGraph/build/src:deps/LAGraph/build/experimental:/usr/local/lib cargo test

How build.rs handles linking

build.rs performs two jobs:

  1. Native linking. It emits six Cargo directives:

    • cargo:rustc-link-lib=dylib=graphblas — dynamically links libgraphblas.
    • cargo:rustc-link-search=native=/usr/local/lib — adds the system GraphBLAS install path to the native library search path.
    • cargo:rustc-link-lib=dylib=lagraph — dynamically links liblagraph.
    • cargo:rustc-link-search=native=deps/LAGraph/build/src — adds the submodule's core build output to the native library search path.
    • cargo:rustc-link-lib=dylib=lagraphx — dynamically links liblagraphx (experimental algorithms).
    • cargo:rustc-link-search=native=deps/LAGraph/build/experimental — adds the experimental build output to the native library search path.

    LAGraph does not need to be installed system-wide; building the submodule in deps/LAGraph/ is sufficient for compilation and linking. SuiteSparse:GraphBLAS must be installed system-wide (sudo make install).

    At runtime the OS dynamic linker (ld.so) does not use Cargo's link search paths — it only consults LD_LIBRARY_PATH, rpath, and the system library cache. Set LD_LIBRARY_PATH=/usr/local/lib after a system-wide LAGraph install, or include the submodule build paths if not installing system-wide.

  2. Optional FFI binding regeneration (feature regenerate-bindings). When the feature is active, regenerate_bindings() runs bindgen against deps/LAGraph/include/LAGraph.h and deps/LAGraph/include/LAGraphX.h (always from the submodule — no system path search), plus GraphBLAS.h (searched in /usr/local/include/suitesparse and /usr/include/suitesparse). The generated Rust file is written to src/lagraph_sys_generated.rs. Only a curated allowlist of GraphBLAS/LAGraph types and functions is exposed (see the allowlist_* calls in build.rs).

Feature flags

Feature Effect
regenerate-bindings Runs bindgen at build time to regenerate src/lagraph_sys_generated.rs from LAGraph.h, LAGraphX.h (both from deps/LAGraph/include) and GraphBLAS.h. Without this feature the checked-in bindings are used as-is.

Pre-generated FFI bindings

The file src/lagraph_sys_generated.rs is checked into version control. CI regenerates it with --features regenerate-bindings. Do not hand-edit this file.

Architecture & Key Abstractions

Edge

Edge is the universal currency between format parsers and graph builders: { source: String, target: String, label: String }.

GraphSource trait

GraphSource<B> is implemented by any data source that knows how to feed itself into a specific [GraphBuilder]:

Csv<R>, MatrixMarket, and Rdf implement GraphSource<InMemoryBuilder> (see src/graph/inmemory.rs), so they can be passed to [GraphBuilder::load] and [Graph::try_from].

GraphBuilder trait

GraphBuilder accumulates edges and produces a GraphDecomposition:

InMemoryBuilder also exposes lower-level helpers outside the trait:

Backend trait & Graph<B> handle

Backend associates a marker type with a concrete builder/graph pair:

pub trait Backend {
    type Graph: GraphDecomposition;
    type Builder: GraphBuilder<Graph = Self::Graph>;
}

Graph<B> is a zero-sized handle parameterised by a Backend:

InMemory is the concrete backend marker type.

GraphDecomposition trait

GraphDecomposition is the read-only query interface:

Generic evaluator abstraction (src/eval/)

src/eval/mod.rs defines query-language-agnostic evaluator traits:

  • Evaluator uses associated types for Query, Result, Error, and Prepared. The graph backend stays a method-level generic (G: GraphDecomposition) so one evaluator type can run against any graph backend selected at the call site.
  • PreparedEvaluator represents prepared (query, graph) state that can be executed repeatedly, which is used by benchmark timing loops.
  • ResultCount is separate from Evaluator::Result; only CLI runners that need counts require this bound, leaving room for future evaluators with richer result types.

InMemoryBuilder / InMemoryGraph

InMemoryBuilder is the primary GraphBuilder implementation. It collects edges in RAM, then build() calls GraphBLAS to create one GrB_Matrix per label via COO format, wraps each in an LAGraph_Graph, and returns an InMemoryGraph.

Multiple CSV sources can be chained with repeated .load() calls; all edges are merged into a single graph.

Node ID representation: Internally, InMemoryBuilder uses HashMap<usize, String> for id_to_node (changed from Vec<String> to support sparse/pre-assigned IDs from MatrixMarket). The set_node_map() method allows bulk-installing a node mapping, which is used by the MatrixMarket loader.

Format parsers

Three built-in parsers are available, each yielding Iterator<Item = Result<Edge, FormatError>> and pluggable into GraphBuilder::load() via GraphSource<InMemoryBuilder> (see src/graph/inmemory.rs).

Csv<R>

Csv<R> parses delimiter-separated edge files.

Configuration is via CsvConfig:

Field Default Description
source_column Index(0) Column for the source node (by index or name)
target_column Index(1) Column for the target node
label_column Index(2) Column for the edge label
has_header true Whether the first row is a header
delimiter b',' Field delimiter byte

ColumnSpec is either Index(usize) or Name(String). Name-based lookup requires has_header: true.

MatrixMarket directory format

MatrixMarket loads an edge-labeled graph from a directory with:

  • vertices.txt — one line per node: <node_name> <1-based-index> on disk; get_node_id returns the matching 0-based matrix index
  • edges.txt — one line per label: <label_name> <1-based-index> (selects n.txt)
  • <n>.txt — MatrixMarket adjacency matrix for label with index n

Names in mapping files may be written with SPARQL-style angle brackets (e.g. <Article1>). parse_index_map strips a single pair of surrounding </> so dictionary keys match short labels (Article1), aligning with IRIs after RpqQuery::strip_base on SPARQL-derived queries.

The loader uses LAGraph_MMRead to parse each .txt file into a GrB_Matrix, then wraps it in an LAGraph_Graph. Vertex indices from vertices.txt are converted to 0-based and installed via InMemoryBuilder::set_node_map().

Helper functions:

MatrixMarket implements GraphSource<InMemoryBuilder> in src/graph/inmemory.rs (see the impl at line 215): vertices.txt maps are converted from 1-based file indices to 0-based matrix ids before set_node_map; edges.txt indices are unchanged for n.txt lookup.

Rdf — Unified RDF Parser

Rdf is a unified parser for RDF formats using oxttl and oxrdf. It supports both N-Triples (.nt) and Turtle (.ttl) formats via the RdfFormat enum.

Each triple (subject, predicate, object) becomes an Edge where:

  • source — subject IRI or blank-node ID (_:label).
  • target — object IRI or blank-node ID; triples whose object is an RDF literal yield Err(FormatError::LiteralAsNode) (callers may filter these out).
  • label — full predicate IRI string (including fragment #… when present).

Constructor:

  • Rdf::from_path(path) — auto-detects format from file extension (.nt → N-Triples, .ttl → Turtle). Parses in parallel using memory-mapping and rayon.

Format detection via RdfFormat::from_path(path):

Extension Format
.nt, .ntriples RdfFormat::NTriples
.ttl, .turtle RdfFormat::Turtle

Example usage:

use pathrex::formats::Rdf;
use pathrex::graph::{Graph, InMemory};

// Auto-detect from extension
let graph = Graph::<InMemory>::try_from(
    Rdf::from_path("data.ttl")?
)?;

SPARQL parsing (src/sparql/mod.rs)

The sparql module uses the spargebra crate to parse SPARQL 1.1 query strings and build a pathrex-native RpqQuery for RPQ evaluators.

Supported query form: SELECT queries with exactly one triple or property path pattern in the WHERE clause. Relative IRIs such as <knows> require a BASE declaration (or PREFIX / full IRIs). Example:

BASE <http://example.org/>
SELECT ?x ?y WHERE { ?x <knows>/<likes>* ?y . }

Key public items:

  • parse_rpq(sparql) — parses a SPARQL string with SparqlParser and returns an RpqQuery.
  • extract_rpq(query) — validates a parsed [spargebra::Query] is a SELECT with a single path pattern and returns an RpqQuery. Use this when you construct a custom SparqlParser (e.g. with prefix declarations) and call parse_query yourself.
  • ExtractError — error enum for extraction failures (NotSelect, NotSinglePath, UnsupportedSubject, UnsupportedObject, VariablePredicate). Converts to RpqError::Extract via #[from].

Call RpqQuery::strip_base when graph edge labels are short names and the parsed query contains full IRIs sharing a common prefix.

The module handles spargebra's desugaring of sequence paths (?x <a>/<b>/<c> ?y) from a chain of BGP triples back into a single path expression.

RPQ evaluation (src/rpq/)

The rpq module provides an abstraction for evaluating Regular Path Queries (RPQs) over edge-labeled graphs using GraphBLAS/LAGraph.

Key public items:

NfaRpqResult wraps a [GraphblasVector] of reachable target vertices. When the subject is a variable, every vertex is used as a source and LAGraph_RegularPathQuery returns the union of targets — individual (source, target) pairs are not preserved.

RpqMatrixEvaluator (src/rpq/rpqmatrix.rs)

RpqMatrixEvaluator compiles [PathExpr] into a Boolean matrix plan over label adjacency matrices and runs [LAGraph_RPQMatrix]. It returns RpqMatrixResult: the path-relation nnz plus a [GraphblasMatrix] duplicate of the result matrix (full reachability relation for the path). Subject/object do not filter the matrix; a named subject is only validated to exist. Bound objects are not supported yet ([RpqError::UnsupportedPath]). NTriples<R> parses W3C N-Triples RDF files using oxttl and oxrdf. Each triple (subject, predicate, object) becomes an Edge where:

  • source — subject IRI or blank-node ID (_:label).
  • target — object IRI or blank-node ID; triples whose object is an RDF literal yield Err(FormatError::LiteralAsNode) (callers may filter these out).
  • label — full predicate IRI string (including fragment #… when present).

Constructor:

SPARQL parsing (src/sparql/mod.rs)

The rpq module provides an abstraction for evaluating Regular Path Queries (RPQs) over edge-labeled graphs using GraphBLAS/LAGraph.

Key public items:

NfaRpqEvaluator (src/rpq/nfarpq.rs)

NfaRpqEvaluator implements [RpqEvaluator] by:

  1. Converting a [PathExpr] into an Nfa via Thompson's construction (Nfa::from_path_expr()).
  2. Eliminating ε-transitions via epsilon closure (NfaBuilder::epsilon_closure()).
  3. Building one LAGraph_Graph per NFA label transition (Nfa::build_lagraph_matrices()).
  4. Calling [LAGraph_RegularPathQuery] with the NFA matrices, data-graph matrices, start/final states, and source vertices.

type Result = NfaRpqResult ([GraphblasVector] of reachable targets).

Supported path operators match [PathExpr] variants above. Reverse and NegatedPropertySet from SPARQL map to [RpqError::UnsupportedPath] when they appear in extracted paths.

Subject/object resolution: [Endpoint::Variable] means "all vertices"; [Endpoint::Named] resolves to a single vertex via GraphDecomposition::get_node_id().

NfaRpqResult wraps a [GraphblasVector] of reachable target vertices. When the subject is a variable, every vertex is used as a source and LAGraph_RegularPathQuery returns the union of targets — individual (source, target) pairs are not preserved.

RpqMatrixEvaluator (src/rpq/rpqmatrix.rs)

RpqMatrixEvaluator compiles [PathExpr] into a Boolean matrix plan over label adjacency matrices and runs [LAGraph_RPQMatrix]. It returns RpqMatrixResult: the path-relation nnz plus a [GraphblasMatrix] duplicate of the result matrix (full reachability relation for the path). Subject/object do not filter the matrix; a named subject is only validated to exist. Bound objects are not supported yet ([RpqError::UnsupportedPath]).

CLI dispatch (src/cli/dispatch.rs)

With the bench feature enabled, src/cli/dispatch.rs is the single mapping from Algo variants to concrete evaluator types. dispatch_query and dispatch_bench each perform one exhaustive match per requested algorithm, then call generic runners (run_query_for_evaluator<E> and run_bench_for_evaluator<E>) that are monomorphized for the selected evaluator.

Adding a new algorithm requires a new Algo variant, its Display arm, one dispatch_query arm, one dispatch_bench arm, an impl Evaluator for the evaluator type, and an impl ResultCount for any result type used by CLI count reporting.

FFI layer

lagraph_sys exposes raw C bindings for GraphBLAS and LAGraph. Safe Rust wrappers live in graph::mod:

  • LagraphGraph — RAII wrapper around LAGraph_Graph (calls LAGraph_Delete on drop). Also provides LagraphGraph::from_coo() to build directly from COO arrays.
  • GraphblasVector — RAII wrapper around GrB_Vector (derives Debug).
  • GraphblasMatrix — RAII wrapper around GrB_Matrix (dup + free on drop).
  • ensure_grb_init() — internal one-time LAGraph_Init via std::sync::Once. Called automatically by RAII-wrapped constructors (LagraphGraph::from_coo, LagraphGraph::from_matrix, ThreadScope::enter) and by load_mm_file. Crate-private; no other code should call it.

Macros & helpers (src/utils.rs)

Two #[macro_export] macros handle FFI error mapping:

  • grb_ok!(expr) — evaluates a GraphBLAS call inside unsafe, maps the i32 return to Result<(), GraphError::GraphBlas(info)>.
  • la_ok!(fn::path(args…)) — evaluates a LAGraph call, automatically appending the required *mut i8 message buffer, and maps failure to GraphError::LAGraph(info, msg).

A convenience function is also provided:

  • build_graph(edges) — builds an InMemoryGraph from a slice of (&str, &str, &str) triples (source, target, label). Used by integration tests.

Coding Conventions

  • Rust edition 2024.
  • Error handling via thiserror derive macros; three main error enums: GraphError, FormatError, and RpqError.
  • FormatError converts into GraphError via #[from] FormatError on the GraphError::Format variant.
  • GraphError converts into RpqError via #[from] GraphError on the RpqError::Graph variant, enabling ? propagation in evaluators.
  • Unsafe FFI calls are confined to lagraph_sys, graph/mod.rs, graph/inmemory.rs, rpq/nfarpq.rs. All raw pointers are wrapped in RAII types that free resources on drop.
  • unsafe impl Send + Sync is provided for LagraphGraph, GraphblasVector, and GraphblasMatrix because GraphBLAS handles are thread-safe after init.
  • Unit tests live in #[cfg(test)] mod tests blocks inside each module. Integration tests that need GraphBLAS live in tests/inmemory_tests.rs, tests/mm_tests.rs, tests/nfarpq_tests.rs.

Testing

# Run all tests (LAGraph installed system-wide)
LD_LIBRARY_PATH=/usr/local/lib cargo test --verbose

# If LAGraph is NOT installed system-wide (only built in the submodule):
LD_LIBRARY_PATH=deps/LAGraph/build/src:deps/LAGraph/build/experimental:/usr/local/lib cargo test --verbose

Tests in src/graph/mod.rs use CountingBuilder / CountOutput / VecSource from src/utils.rs — these do not call into GraphBLAS and run without native libraries.

Tests in src/formats/csv.rs and src/formats/rdf.rs are pure Rust and need no native dependencies.

Tests in src/sparql/mod.rs are pure Rust and need no native dependencies.

Tests in src/rpq/nfarpq.rs (NFA construction unit tests) are pure Rust and need no native dependencies.

Tests in src/graph/inmemory.rs, tests/inmemory_tests.rs, tests/mm_tests.rs, tests/nfarpq_tests.rs, and tests/rpqmatrix_tests.rs call real GraphBLAS/LAGraph and require the native libraries to be present.

CI

The GitHub Actions workflow (.github/workflows/ci.yml) runs on every push and PR across stable, beta, and nightly toolchains:

  1. Checks out with submodules: recursive.
  2. Installs cmake, libclang-dev, clang.
  3. Builds and installs SuiteSparse:GraphBLAS from source (sudo make install).
  4. Builds and installs LAGraph from the submodule (sudo make install).
  5. cargo build --features regenerate-bindings — rebuilds FFI bindings.
  6. LD_LIBRARY_PATH=/usr/local/lib cargo test --verbose — runs the full test suite.