Implement regular path query algorithm with pathes#13
Open
suvorovrain wants to merge 11 commits into
Open
Conversation
8ab78aa to
ef02df7
Compare
This commit adds an implementation of the regular path query algorithm based on linear-algebra graph processing approach. The algorithm finds a set of nodes in a edge-labelled directed graph. These nodes are reachable by paths starting from one of source nodes and having edges labels conform a word from the specified regular language. This algorithm is based on the bread-first-search algorithm over the adjacency matrices. Regular languages are defined by non-deterministic finite automaton. The algorithm considers the paths on which "label words" are accepted by the specified NFA. The algorithm is used with the following inputs: * A regular automaton adjacency matrix decomposition. * A graph adjacency matrix decomposition. * An array of the starting node indices. It results with a vector, having v[i] = 1 iff the node is reachable by a path satisfying the provided regular constraints.
This patch is used to make the regular path query algorithm work with 2-RPQs. 2-RPQs represent RPQs extended with possibility of traversing graphs into the directions opposite to the presented edges. E.g. SPARQL 2-RPQ `Alice ^<mother> <daughter> ?x` could be used to find Alice and all of her sisters by getting all Alice mother's daughters. 2-RPQ support is provided by adding two extra parameters to the RPQ algorithm. One of them is used to specify some of the provided labels as inversed. The second one inverses the whole query allowing to execute single-destination RPQs (e.g. `?x <Son> Bob` gets Bob's parents).
This patch provides a workaround for benchmarking 2-RPQ algorithm on a few real-world datasets like Wikidata or yago-2s by allowing duplicates in MatrixMarket files corresponding to boolean matrices since most of the publicly available graphs likely to have duplicates.
Full description TBD.
Handle too many paths via custom arena-based linear allocator that is cleared at the end of the 2RPQ ALL PATHS procedure. It is used to construct elements of matrices having too many paths in them. It also offers OOM detection.
This patch introduces ALL SHORTEST PATH semantics in the regular path query algorithm. The key insight is really similar to the reachability (i.e. ENPOINTS) semantics described in detail in [^1]. The idea of SINGLE SOURCE ALL SHORTEST PATH semantics is for a given query $Q$, a graph $G$, and a vertex $s$ is for all vertices $v$ to find all minimum length paths from $s$ to $v$. The implementation combines custom semirings for ALL PATHS along with filtering already-visited pairs of NFA states and graph vertices. [^1] https://arxiv.org/abs/2412.10287
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Algorithm for Evaluating 2-RPQ Queries with Path Storing
This PR introduces an algorithm for evaluating 2-RPQ queries with path storing.
The algorithm supports three path semantics:
ALL-SHORTEST-PATHSALL-SIMPLE-PATHSALL-TRAILSThese semantics correspond to the path semantics used in the
MillenniumDB Path Query Challenge.
Algorithm idea
The algorithm is based on the standard linear-algebra approach for evaluating
RPQ. In the classical version, RPQ evaluation is reduced to
reachability computation over Boolean matrices: graph labels are represented as
adjacency matrices, the query is represented as a finite automaton, and matrix
operations propagate reachable graph/automaton states.
This PR generalizes that approach by replacing the Boolean semiring with custom
path semirings. Instead of storing only whether a state is reachable, matrix
elements store path information. Semiring operations define how paths are
extended, combined, and filtered according to the selected semantics.
In particular:
ALL-SIMPLE-PATHS;ALL-TRAILS;ALL-SHORTEST-PATHS.As a result, the Boolean reachability algorithm can be seen as a special case of

the same framework, while path-producing semantics are implemented by changing
the underlying semiring.
One step of algorithm with
ALL-SIMPLEsemantic are provided on following figure:Memory management
Path-storing evaluation may create a large number of intermediate path objects.
To reduce allocation overhead, the implementation uses a custom allocator for
path data structures. The allocator centralizes ownership of intermediate path
objects and allows the algorithm to allocate paths efficiently during semiring
operations, instead of relying on many small heap allocations. Also this approach provide
the opportunity to work with variable-size object using custom structures in GraphBLAS
primitives which size must be constant.