Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 30 additions & 12 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@ pathrex/
│ └── formats/
│ ├── mod.rs # FormatError enum, re-exports
│ ├── csv.rs # Csv<R> — CSV → Edge iterator (CsvConfig, ColumnSpec)
│ └── mm.rs # MatrixMarket directory loader (vertices.txt, edges.txt, *.txt)
│ ├── mm.rs # MatrixMarket directory loader (vertices.txt, edges.txt, *.txt)
│ └── nt.rs # NTriples<R> — N-Triples → Edge iterator (full predicate IRI labels)
├── tests/
│ ├── inmemory_tests.rs # Integration tests for InMemoryBuilder / InMemoryGraph
│ └── mm_tests.rs # Integration tests for MatrixMarket format
Expand Down Expand Up @@ -119,7 +120,7 @@ regenerates it with `--features regenerate-bindings`. **Do not hand-edit this fi

### Edge

[`Edge`](src/graph/mod.rs:154) is the universal currency between format parsers and graph
[`Edge`](src/graph/mod.rs:158) is the universal currency between format parsers and graph
builders: `{ source: String, target: String, label: String }`.

### GraphSource trait
Expand All @@ -130,8 +131,9 @@ feed itself into a specific [`GraphBuilder`]:
- [`apply_to(self, builder: B) -> Result<B, B::Error>`](src/graph/mod.rs:165) — consumes the
source and returns the populated builder.

[`Csv<R>`](src/formats/csv.rs:52) implements `GraphSource<InMemoryBuilder>` directly, so it
can be passed to [`GraphBuilder::load`].
[`Csv<R>`](src/formats/csv.rs), [`MatrixMarket`](src/formats/mm.rs), and [`NTriples<R>`](src/formats/nt.rs)
implement `GraphSource<InMemoryBuilder>` (see [`src/graph/inmemory.rs`](src/graph/inmemory.rs)), so they
can be passed to [`GraphBuilder::load`] and [`Graph::try_from`].

### GraphBuilder trait

Expand Down Expand Up @@ -194,12 +196,13 @@ which is used by the MatrixMarket loader.

### Format parsers

Two built-in parsers are available:
Three built-in parsers are available, each yielding
`Iterator<Item = Result<Edge, FormatError>>` and pluggable into
`GraphBuilder::load()` via `GraphSource<InMemoryBuilder>` (see [`src/graph/inmemory.rs`](src/graph/inmemory.rs)).

#### CSV format
#### `Csv<R>`

[`Csv<R>`](src/formats/csv.rs:52) yields `Iterator<Item = Result<Edge, FormatError>>` and is
directly pluggable into `GraphBuilder::load()` via its `GraphSource<InMemoryBuilder>` impl.
[`Csv<R>`](src/formats/csv.rs) parses delimiter-separated edge files.

Configuration is via [`CsvConfig`](src/formats/csv.rs:17):

Expand All @@ -216,7 +219,7 @@ Name-based lookup requires `has_header: true`.

#### MatrixMarket directory format

[`MatrixMarket`](src/formats/mm.rs:160) loads an edge-labeled graph from a directory with:
[`MatrixMarket`](src/formats/mm.rs:159) loads an edge-labeled graph from a directory with:

- `vertices.txt` — one line per node: `<node_name> <1-based-index>` on disk; [`get_node_id`](src/graph/mod.rs:199) returns the matching **0-based** matrix index
- `edges.txt` — one line per label: `<label_name> <1-based-index>` (selects `n.txt`)
Expand All @@ -228,12 +231,27 @@ converted to 0-based and installed via [`InMemoryBuilder::set_node_map()`](src/g

Helper functions:

- [`load_mm_file(path)`](src/formats/mm.rs:64) — reads a single MatrixMarket file into a
- [`load_mm_file(path)`](src/formats/mm.rs:39) — reads a single MatrixMarket file into a
`GrB_Matrix`.
- [`parse_index_map(path)`](src/formats/mm.rs) — parses `<name> <index>` lines; indices must be **>= 1** and **unique** within the file.
- [`parse_index_map(path)`](src/formats/mm.rs:81) — parses `<name> <index>` lines; indices must be **>= 1** and **unique** within the file.

`MatrixMarket` implements `GraphSource<InMemoryBuilder>` in [`src/graph/inmemory.rs`](src/graph/inmemory.rs): `vertices.txt` maps are converted from 1-based file indices to 0-based matrix ids before [`set_node_map`](src/graph/inmemory.rs:67); `edges.txt` indices are unchanged for `n.txt` lookup.

#### `NTriples<R>`

[`NTriples<R>`](src/formats/nt.rs:51) parses [W3C N-Triples](https://www.w3.org/TR/n-triples/)
RDF files using `oxttl` and `oxrdf`. Each triple `(subject, predicate, object)` becomes an
[`Edge`](src/graph/mod.rs:158) where:

- `source` — subject IRI or blank-node ID (`_:label`).
- `target` — object IRI or blank-node ID; triples whose object is an RDF
literal yield `Err(FormatError::LiteralAsNode)` (callers may filter these out).
- `label` — full predicate IRI string (including fragment `#…` when present).

Constructor:

- [`NTriples::new(reader)`](src/formats/nt.rs:56) — parses the stream; each predicate IRI is copied verbatim to the edge label.

### FFI layer

[`lagraph_sys`](src/lagraph_sys.rs) exposes raw C bindings for GraphBLAS and
Expand Down Expand Up @@ -284,7 +302,7 @@ Tests in `src/graph/mod.rs` use `CountingBuilder` / `CountOutput` / `VecSource`
[`src/utils.rs`](src/utils.rs) — these do **not** call into GraphBLAS and run without
native libraries.

Tests in `src/formats/csv.rs` are pure Rust and need no native dependencies.
Tests in `src/formats/csv.rs` and `src/formats/nt.rs` are pure Rust and need no native dependencies.

Tests in `src/graph/inmemory.rs` and [`tests/inmemory_tests.rs`](tests/inmemory_tests.rs)
call real GraphBLAS/LAGraph and require the native libraries to be present.
Expand Down
19 changes: 19 additions & 0 deletions src/formats/csv.rs
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,25 @@ mod tests {
assert!(edges.is_empty());
}

#[test]
fn test_non_ascii() {
let csv = "source,target,label\n\
人甲,人乙,认识\n\
Алиса,Боб,знает\n";
let edges: Vec<_> = make_csv(csv).collect();
assert_eq!(edges.len(), 2);

let e0 = edges[0].as_ref().unwrap();
assert_eq!(e0.source, "人甲");
assert_eq!(e0.target, "人乙");
assert_eq!(e0.label, "认识");

let e1 = edges[1].as_ref().unwrap();
assert_eq!(e1.source, "Алиса");
assert_eq!(e1.target, "Боб");
assert_eq!(e1.label, "знает");
}

#[test]
fn test_graph_source_impl() {
use crate::graph::{GraphBuilder, GraphDecomposition, InMemoryBuilder};
Expand Down
18 changes: 17 additions & 1 deletion src/formats/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,27 @@
//!
//! ```no_run
//! use pathrex::graph::{Graph, InMemory, GraphDecomposition};
//! use pathrex::formats::Csv;
//! use pathrex::formats::{Csv, NTriples};
//! use std::fs::File;
//!
//! // Build from CSV in one line
//! let g = Graph::<InMemory>::try_from(
//! Csv::from_reader(File::open("edges.csv").unwrap()).unwrap()
//! ).unwrap();
//!
//! // Build from N-Triples in one line
//! let g2 = Graph::<InMemory>::try_from(
//! NTriples::new(File::open("data.nt").unwrap())
//! ).unwrap();
//! ```

pub mod csv;
pub mod mm;
pub mod nt;

pub use csv::Csv;
pub use mm::MatrixMarket;
pub use nt::NTriples;

use thiserror::Error;

Expand Down Expand Up @@ -49,4 +56,13 @@ pub enum FormatError {
line: usize,
reason: String,
},

/// An error produced by the N-Triples parser.
#[error("N-Triples parse error: {0}")]
NTriples(String),

/// An RDF literal appeared as a subject or object where a node IRI or
/// blank node was expected.
#[error("RDF literal cannot be used as a graph node")]
LiteralAsNode,
}
210 changes: 210 additions & 0 deletions src/formats/nt.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
//! N-Triples edge iterator for the formats layer.
//!
//! ```no_run
//! use pathrex::formats::NTriples;
//! use pathrex::formats::FormatError;
//!
//! # let reader = std::io::empty();
//! let iter = NTriples::new(reader)
//! .filter_map(|r| match r {
//! Err(FormatError::LiteralAsNode) => None, // skip
//! other => Some(other),
//! });
//! ```
//!
//! To load into a graph:
//!
//! ```no_run
//! use pathrex::graph::{Graph, InMemory, GraphDecomposition};
//! use pathrex::formats::NTriples;
//! use std::fs::File;
//!
//! let graph = Graph::<InMemory>::try_from(
//! NTriples::new(File::open("data.nt").unwrap())
//! ).unwrap();
//! ```

use std::io::Read;

use oxrdf::{NamedOrBlankNode, Term};
use oxttl::NTriplesParser;
use oxttl::ntriples::ReaderNTriplesParser;

use crate::formats::FormatError;
use crate::graph::Edge;

/// An iterator that reads N-Triples and yields `Result<Edge, FormatError>`.
///
/// # Example
///
/// ```no_run
/// use pathrex::formats::nt::NTriples;
/// use std::fs::File;
///
/// let file = File::open("data.nt").unwrap();
/// let iter = NTriples::new(file);
/// for result in iter {
/// let edge = result.unwrap();
/// println!("{} --{}--> {}", edge.source, edge.label, edge.target);
/// }
/// ```
pub struct NTriples<R: Read> {
inner: ReaderNTriplesParser<R>,
}

impl<R: Read> NTriples<R> {
pub fn new(reader: R) -> Self {
Self {
inner: NTriplesParser::new().for_reader(reader),
}
}

fn subject_to_node_id(subject: NamedOrBlankNode) -> String {
match subject {
NamedOrBlankNode::NamedNode(n) => n.into_string(),
NamedOrBlankNode::BlankNode(b) => format!("_:{}", b.as_str()),
}
}

fn object_to_node_id(object: Term) -> Result<String, FormatError> {
match object {
Term::NamedNode(n) => Ok(n.into_string()),
Term::BlankNode(b) => Ok(format!("_:{}", b.as_str())),
Term::Literal(_) => Err(FormatError::LiteralAsNode),
}
}
}

impl<R: Read> Iterator for NTriples<R> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<...> <http://a.org/knows> <...>
<...> <http://b.org/knows> <...>

We lose info about difference between these two predicates if use LocalName. Is it good?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to get rid of local name extraction strategy and stick to fulliri always

type Item = Result<Edge, FormatError>;

fn next(&mut self) -> Option<Self::Item> {
let triple = match self.inner.next()? {
Ok(t) => t,
Err(e) => return Some(Err(FormatError::NTriples(e.to_string()))),
};

let source = Self::subject_to_node_id(triple.subject.into());
let label = triple.predicate.as_str().to_owned();
let target = match Self::object_to_node_id(triple.object) {
Ok(t) => t,
Err(e) => return Some(Err(e)),
};

Some(Ok(Edge {
source,
target,
label,
}))
}
}

#[cfg(test)]
mod tests {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you, please, introduce a test involving non-ascii chars. Especially I am interested in Chinese and Cyrillic support.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test with russain and chinese alice and bob

use super::*;

fn parse(nt: &str) -> Vec<Result<Edge, FormatError>> {
NTriples::new(nt.as_bytes()).collect()
}

#[test]
fn test_basic_ntriples() {
let nt = "<http://example.org/Alice> <http://example.org/knows> <http://example.org/Bob> .\n\
<http://example.org/Bob> <http://example.org/likes> <http://example.org/Charlie> .\n";
let edges = parse(nt);
assert_eq!(edges.len(), 2);

let e0 = edges[0].as_ref().unwrap();
assert_eq!(e0.source, "http://example.org/Alice");
assert_eq!(e0.target, "http://example.org/Bob");
assert_eq!(e0.label, "http://example.org/knows");

let e1 = edges[1].as_ref().unwrap();
assert_eq!(e1.source, "http://example.org/Bob");
assert_eq!(e1.target, "http://example.org/Charlie");
assert_eq!(e1.label, "http://example.org/likes");
}

#[test]
fn test_blank_node_subject_and_object() {
let nt = "_:b1 <http://example.org/knows> _:b2 .\n";
let edges = parse(nt);
assert_eq!(edges.len(), 1);

let e = edges[0].as_ref().unwrap();
assert_eq!(e.source, "_:b1");
assert_eq!(e.target, "_:b2");
}

#[test]
fn test_literal_object_yields_error() {
let nt = "<http://example.org/Alice> <http://example.org/name> \"Alice\" .\n";
let edges = parse(nt);
assert_eq!(edges.len(), 1);
assert!(
matches!(edges[0], Err(FormatError::LiteralAsNode)),
"literal object should yield LiteralAsNode error"
);
}

#[test]
fn test_caller_can_skip_literal_triples() {
let nt = "<http://example.org/Alice> <http://example.org/knows> <http://example.org/Bob> .\n\
<http://example.org/Alice> <http://example.org/name> \"Alice\" .\n\
<http://example.org/Bob> <http://example.org/knows> <http://example.org/Charlie> .\n";
let edges: Vec<_> = NTriples::new(nt.as_bytes())
.filter_map(|r| match r {
Err(FormatError::LiteralAsNode) => None,
other => Some(other),
})
.collect();

assert_eq!(edges.len(), 2, "literal triple should be skipped");
assert!(edges.iter().all(|r| r.is_ok()));
}

#[test]
fn test_predicate_with_fragment_is_full_iri_string() {
let nt =
"<http://example.org/Alice> <http://example.org/ns#knows> <http://example.org/Bob> .\n";
let edges = parse(nt);
assert_eq!(
edges[0].as_ref().unwrap().label,
"http://example.org/ns#knows"
);
}

#[test]
fn test_non_ascii_in_iris() {
let nt = "<http://example.org/人甲> <http://example.org/关系/认识> <http://example.org/人乙> .\n\
<http://example.org/Алиса> <http://example.org/знает> <http://example.org/Боб> .\n";
let edges = parse(nt);
assert_eq!(edges.len(), 2);

let e0 = edges[0].as_ref().unwrap();
assert_eq!(e0.source, "http://example.org/人甲");
assert_eq!(e0.target, "http://example.org/人乙");
assert_eq!(e0.label, "http://example.org/关系/认识");

let e1 = edges[1].as_ref().unwrap();
assert_eq!(e1.source, "http://example.org/Алиса");
assert_eq!(e1.target, "http://example.org/Боб");
assert_eq!(e1.label, "http://example.org/знает");
}

#[test]
fn test_ntriples_graph_source() {
use crate::graph::{GraphBuilder, GraphDecomposition, InMemoryBuilder};

let nt = "<http://example.org/A> <http://example.org/knows> <http://example.org/B> .\n\
<http://example.org/B> <http://example.org/knows> <http://example.org/C> .\n";
let iter = NTriples::new(nt.as_bytes());

let graph = InMemoryBuilder::default()
.load(iter)
.expect("load should succeed")
.build()
.expect("build should succeed");
assert_eq!(graph.num_nodes(), 3);
}
}
Loading
Loading