Skip to content

Commit 969687b

Browse files
jmjavacursoragent
andcommitted
Add ingestion pipeline, directory support, user profiles, and fault tolerance
IngestionRunner (ApplicationRunner) ingests configured URLs and local directories on startup when guide.reload-content-on-startup=true and prints a structured INGESTION COMPLETE banner summarizing results. Ingestion is fault-tolerant at every level: per-URL, per-directory, and per-document failures are collected with reasons into IngestionResult and never block remaining items. DataManager depends on ChunkingContentElementRepository from rag-core instead of a custom RagStore wrapper, using the library's existing storage abstraction. Stats use ContentElementRepositoryInfo directly. GuideProperties gains a directories list for local repo ingestion and robust path resolution (tilde, absolute, and relative paths). User profiles live under scripts/user-config/ (gitignored). fresh-ingest.sh wipes and re-ingests from scratch; append-ingest.sh adds without clearing. Both read GUIDE_PROFILE from .env and pass the config location to Spring. Includes .env.example, INGESTION-TESTING.md, and 97 passing tests. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 6112d2d commit 969687b

24 files changed

+1242
-46
lines changed

.env.example

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Copy to .env and fill in your values:
2+
# cp .env.example .env
3+
4+
# Your personal profile name (loads application-<GUIDE_PROFILE>.yml)
5+
# Create your config: cp scripts/user-config/application-user.yml.example scripts/user-config/application-<yourname>.yml
6+
GUIDE_PROFILE=user
7+
8+
# OpenAI API key (required for embeddings and chat)
9+
OPENAI_API_KEY=sk-proj-your-key-here
10+
11+
# Neo4j (optional — defaults shown)
12+
# NEO4J_USERNAME=neo4j
13+
# NEO4J_PASSWORD=brahmsian
14+
# NEO4J_URI=bolt://localhost:7687
15+
16+
# Discord bot token (optional — only needed for Discord integration)
17+
# DISCORD_TOKEN=your-discord-token

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Gradle
22
.gradle/
33
**/build/
4+
**/bin/
45

56
# MCP resources
67
embabel-agent-api/src/main/resources/mcp/**
@@ -33,6 +34,11 @@ embabel-agent-api/src/main/resources/mcp/**
3334
.env
3435
.envrc
3536

37+
# Personal application overrides (set GUIDE_PROFILE in .env; default profile is "user")
38+
# Ignore all personal profile files except the checked-in example
39+
scripts/user-config/application-*.yml
40+
!scripts/user-config/application-*.yml.example
41+
3642
# Temporary files
3743
*.tmp
3844
*.bak

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ curl -X POST http://localhost:1337/api/v1/data/load-references
3535

3636
To see stats on data, make a GET request or browse to http://localhost:1337/api/v1/data/stats
3737

38+
RAG content storage uses the `ChunkingContentElementRepository` interface from the `embabel-agent-rag-core` library. The default backend is Neo4j via `DrivineStore`. You can plug in other backends by providing a different `ChunkingContentElementRepository` bean.
39+
3840
## Viewing and Deleting Data
3941

4042
Go to the Neo Browser at http://localhost:7474/browser/

scripts/INGESTION-TESTING.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# Testing Guide
2+
3+
## Run all tests
4+
5+
```bash
6+
./mvnw test
7+
```
8+
9+
Runs all 97 tests (unit + integration). Integration tests use Testcontainers to spin up Neo4j automatically — no local Neo4j needed.
10+
11+
## Run specific test classes
12+
13+
```bash
14+
# Single class
15+
./mvnw test -Dtest=IngestionResultTest
16+
17+
# Multiple classes
18+
./mvnw test -Dtest="IngestionResultTest,IngestionRunnerTest,DataManagerControllerTest"
19+
20+
# Single method
21+
./mvnw test -Dtest="IngestionRunnerTest#summary banner contains URL results"
22+
```
23+
24+
## Test coverage by area
25+
26+
### Ingestion pipeline (new)
27+
28+
| Test class | Type | What it covers |
29+
|---|---|---|
30+
| `IngestionResultTest` | Unit | `IngestionResult` record: totals, `hasFailures()`, duration |
31+
| `IngestionRunnerTest` | Unit | `IngestionRunner`: calls `loadReferences`, prints banner with URLs/dirs/stats/port, `formatDuration` |
32+
| `DataManagerControllerTest` | Unit | REST endpoints: `GET /stats`, `POST /load-references` returns `IngestionResult` |
33+
| `DataManagerLoadReferencesIntegrationTest` | Integration | Full pipeline: DataManager → Neo4j. Ingests sample directory, verifies structured result + documents/chunks in store |
34+
35+
Run just these:
36+
37+
```bash
38+
./mvnw test -Dtest="IngestionResultTest,IngestionRunnerTest,DataManagerControllerTest,DataManagerLoadReferencesIntegrationTest"
39+
```
40+
41+
### Other test areas
42+
43+
| Test class | Type | What it covers |
44+
|---|---|---|
45+
| `GuidePropertiesPathResolutionTest` | Unit | Path resolution (`~/`, absolute, relative) |
46+
| `HubApiControllerTest` | Integration | Hub REST API (register, login, sessions, JWT) |
47+
| `HubServiceTest` | Integration | User registration validation |
48+
| `DrivineGuideUserRepositoryTest` | Integration | Neo4j user repository (Drivine) |
49+
| `GuideUserRepositoryDefaultImplTest` | Integration | Neo4j user repository (GraphView) |
50+
| `GuideUserServiceTest` | Integration | Anonymous web user service |
51+
| `McpSecurityTest` | Integration | MCP endpoints are publicly accessible |
52+
53+
## Using local Neo4j (faster iteration)
54+
55+
By default, tests use Testcontainers (slower startup, fully isolated). For faster runs during development:
56+
57+
1. Start Neo4j:
58+
59+
```bash
60+
docker compose up neo4j -d
61+
```
62+
63+
2. Run tests with local Neo4j:
64+
65+
```bash
66+
USE_LOCAL_NEO4J=true ./mvnw test
67+
```
68+
69+
## Manual testing of fresh-ingest.sh
70+
71+
To test the full ingestion flow end-to-end:
72+
73+
1. Set up your `.env` and personal profile (see `scripts/README.md`)
74+
2. Run:
75+
76+
```bash
77+
./scripts/fresh-ingest.sh
78+
```
79+
80+
3. Watch for the **INGESTION COMPLETE** banner with:
81+
- Time elapsed
82+
- Loaded/failed URLs
83+
- Ingested/failed directories
84+
- RAG store stats (documents, chunks, elements)
85+
- Port and MCP endpoint
86+
87+
4. Verify the REST API:
88+
89+
```bash
90+
# Stats
91+
curl http://localhost:1337/api/v1/data/stats
92+
93+
# Trigger ingestion manually (returns JSON IngestionResult)
94+
curl -X POST http://localhost:1337/api/v1/data/load-references
95+
```
96+
97+
5. Verify MCP:
98+
99+
```bash
100+
curl -i --max-time 3 http://localhost:1337/sse
101+
```
102+
103+
Should return `Content-Type: text/event-stream`.

scripts/README.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,45 @@
11
# Shell scripts
22

3-
- `shell.sh` runs the application in interactive shell mode.
3+
| Script | Purpose |
4+
|---|---|
5+
| `fresh-ingest.sh` | Wipes Neo4j RAG data and re-ingests everything from scratch. Use for first-time setup or when you want a clean slate. |
6+
| `append-ingest.sh` | Re-ingests without clearing existing data. Use when you've added new URLs or directories. Comment out already-ingested items in your profile to avoid re-processing them. |
7+
| `shell.sh` | Runs the application in interactive shell mode. |
8+
9+
Both ingestion scripts start Neo4j in Docker, load your personal profile, and print an **INGESTION COMPLETE** banner when done.
10+
11+
## Personal profiles
12+
13+
Both scripts read `GUIDE_PROFILE` from `.env` (default: `user`).
14+
Each developer can have their own Spring profile:
15+
16+
```bash
17+
cp scripts/user-config/application-user.yml.example scripts/user-config/application-yourname.yml
18+
# Edit to taste, then:
19+
echo 'GUIDE_PROFILE=yourname' >> .env
20+
./scripts/fresh-ingest.sh
21+
```
22+
23+
This loads `application-yourname.yml` with your URLs, directories, and settings.
24+
See `scripts/user-config/README.md` for full details.
25+
26+
## Using append-ingest.sh
27+
28+
Since `append-ingest.sh` doesn't clear the store, you should comment out URLs and directories that are already ingested in your profile to avoid re-processing them. For example:
29+
30+
```yaml
31+
guide:
32+
urls:
33+
# - https://docs.embabel.com/embabel-agent/guide/0.3.5-SNAPSHOT/ # already ingested
34+
- https://some-new-url.com # new, will be ingested
35+
directories:
36+
# - ~/github/jmjava/guide # already ingested
37+
- ~/github/jmjava/new-repo # new, will be ingested
38+
```
39+
40+
Then run `./scripts/append-ingest.sh`. The new content is added alongside existing data in Neo4j.
41+
42+
## Tips
43+
44+
- **If ingestion seems stuck** on a URL: the thread is blocked on fetch -> parse -> embed. Try lowering `embedding-batch-size` to 20, or temporarily remove the slow URL.
45+
- **Speed up ingestion**: increase `embedding-batch-size` (default 50) or `max-chunk-size` (default 4000).

scripts/append-ingest.sh

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#!/usr/bin/env bash
2+
# Re-ingest content WITHOUT clearing Neo4j first.
3+
# Existing RAG data is kept; new/updated content is added on top.
4+
# IngestionRunner prints the summary when done.
5+
#
6+
# Set GUIDE_PROFILE in .env to use your own profile (default: "user").
7+
# e.g. GUIDE_PROFILE=menke → loads application-menke.yml
8+
set -e
9+
10+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
11+
GUIDE_ROOT="$(dirname "$SCRIPT_DIR")"
12+
cd "$GUIDE_ROOT"
13+
14+
if [ -f .env ]; then
15+
echo "Loading .env..."
16+
set -a
17+
source .env
18+
set +a
19+
fi
20+
21+
GUIDE_PORT="${GUIDE_PORT:-1337}"
22+
EXISTING_PID=$(lsof -ti :"$GUIDE_PORT" 2>/dev/null | head -1)
23+
if [ -n "$EXISTING_PID" ]; then
24+
echo "Killing existing process on port $GUIDE_PORT (PID $EXISTING_PID)..."
25+
kill "$EXISTING_PID" 2>/dev/null || true
26+
sleep 1
27+
kill -9 "$EXISTING_PID" 2>/dev/null || true
28+
sleep 1
29+
fi
30+
31+
echo "Ensuring Neo4j is up (Docker)..."
32+
docker compose up neo4j -d
33+
34+
NEO4J_BOLT_PORT="${NEO4J_BOLT_PORT:-7687}"
35+
echo "Waiting for Neo4j on port $NEO4J_BOLT_PORT..."
36+
max_wait=60
37+
elapsed=0
38+
while [ $elapsed -lt $max_wait ]; do
39+
if docker exec embabel-neo4j cypher-shell -u "${NEO4J_USERNAME:-neo4j}" -p "${NEO4J_PASSWORD:-brahmsian}" "RETURN 1" >/dev/null 2>&1; then
40+
echo "Neo4j is ready."
41+
break
42+
fi
43+
sleep 3
44+
elapsed=$((elapsed + 3))
45+
echo " ... ${elapsed}s"
46+
done
47+
if [ $elapsed -ge $max_wait ]; then
48+
echo "Neo4j did not become ready in time."
49+
exit 1
50+
fi
51+
52+
echo "Keeping existing RAG data (append mode)."
53+
54+
GUIDE_PROFILE="${GUIDE_PROFILE:-user}"
55+
export SPRING_PROFILES_ACTIVE="local,${GUIDE_PROFILE}"
56+
export NEO4J_URI="${NEO4J_URI:-bolt://localhost:${NEO4J_BOLT_PORT}}"
57+
export NEO4J_HOST="${NEO4J_HOST:-localhost}"
58+
59+
# Force ingestion on startup (IngestionRunner prints the summary)
60+
export GUIDE_RELOADCONTENTONSTARTUP=true
61+
62+
echo ""
63+
echo "Starting Guide with profiles: $SPRING_PROFILES_ACTIVE"
64+
echo "Neo4j: $NEO4J_URI"
65+
echo ""
66+
echo "Ingestion will append to existing data."
67+
echo "Watch for the INGESTION COMPLETE banner."
68+
echo "Press Ctrl+C to stop."
69+
echo ""
70+
71+
# Run in foreground so Ctrl+C kills it directly
72+
# Include scripts/user-config/ so Spring Boot finds personal profile files
73+
./mvnw -DskipTests spring-boot:run -Dspring-boot.run.arguments="--spring.config.additional-location=file:./scripts/user-config/"

scripts/fresh-ingest.sh

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#!/usr/bin/env bash
2+
# Wipe Neo4j RAG data and re-ingest everything from scratch.
3+
# Starts Neo4j (Docker), clears all ContentElement nodes, then runs Guide
4+
# with reload-content-on-startup=true. IngestionRunner prints the summary.
5+
#
6+
# Set GUIDE_PROFILE in .env to use your own profile (default: "user").
7+
# e.g. GUIDE_PROFILE=menke → loads application-menke.yml
8+
set -e
9+
10+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
11+
GUIDE_ROOT="$(dirname "$SCRIPT_DIR")"
12+
cd "$GUIDE_ROOT"
13+
14+
if [ -f .env ]; then
15+
echo "Loading .env..."
16+
set -a
17+
source .env
18+
set +a
19+
fi
20+
21+
GUIDE_PORT="${GUIDE_PORT:-1337}"
22+
EXISTING_PID=$(lsof -ti :"$GUIDE_PORT" 2>/dev/null | head -1)
23+
if [ -n "$EXISTING_PID" ]; then
24+
echo "Killing existing process on port $GUIDE_PORT (PID $EXISTING_PID)..."
25+
kill "$EXISTING_PID" 2>/dev/null || true
26+
sleep 1
27+
kill -9 "$EXISTING_PID" 2>/dev/null || true
28+
sleep 1
29+
fi
30+
31+
echo "Ensuring Neo4j is up (Docker)..."
32+
docker compose up neo4j -d
33+
34+
NEO4J_BOLT_PORT="${NEO4J_BOLT_PORT:-7687}"
35+
echo "Waiting for Neo4j on port $NEO4J_BOLT_PORT..."
36+
max_wait=60
37+
elapsed=0
38+
while [ $elapsed -lt $max_wait ]; do
39+
if docker exec embabel-neo4j cypher-shell -u "${NEO4J_USERNAME:-neo4j}" -p "${NEO4J_PASSWORD:-brahmsian}" "RETURN 1" >/dev/null 2>&1; then
40+
echo "Neo4j is ready."
41+
break
42+
fi
43+
sleep 3
44+
elapsed=$((elapsed + 3))
45+
echo " ... ${elapsed}s"
46+
done
47+
if [ $elapsed -ge $max_wait ]; then
48+
echo "Neo4j did not become ready in time."
49+
exit 1
50+
fi
51+
52+
echo "Clearing RAG content in Neo4j (ContentElement nodes)..."
53+
docker exec embabel-neo4j cypher-shell -u "${NEO4J_USERNAME:-neo4j}" -p "${NEO4J_PASSWORD:-brahmsian}" "MATCH (c:ContentElement) DETACH DELETE c" 2>/dev/null || true
54+
echo "RAG content cleared."
55+
56+
GUIDE_PROFILE="${GUIDE_PROFILE:-user}"
57+
export SPRING_PROFILES_ACTIVE="local,${GUIDE_PROFILE}"
58+
export NEO4J_URI="${NEO4J_URI:-bolt://localhost:${NEO4J_BOLT_PORT}}"
59+
export NEO4J_HOST="${NEO4J_HOST:-localhost}"
60+
61+
# Force ingestion on startup (IngestionRunner prints the summary)
62+
export GUIDE_RELOADCONTENTONSTARTUP=true
63+
64+
echo ""
65+
echo "Starting Guide with profiles: $SPRING_PROFILES_ACTIVE"
66+
echo "Neo4j: $NEO4J_URI"
67+
echo ""
68+
echo "Ingestion will run automatically on startup."
69+
echo "Watch for the INGESTION COMPLETE banner."
70+
echo "Press Ctrl+C to stop."
71+
echo ""
72+
73+
# Run in foreground so Ctrl+C kills it directly
74+
# Include scripts/user-config/ so Spring Boot finds personal profile files
75+
./mvnw -DskipTests spring-boot:run -Dspring-boot.run.arguments="--spring.config.additional-location=file:./scripts/user-config/"

scripts/user-config/README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Personal config
2+
3+
Each developer can have their own Spring profile with personal settings (URLs, directories, paths, etc.).
4+
5+
## Quick start
6+
7+
```bash
8+
cp scripts/user-config/application-user.yml.example scripts/user-config/application-myname.yml
9+
# Edit to taste, then:
10+
echo 'GUIDE_PROFILE=myname' >> .env
11+
./scripts/fresh-ingest.sh
12+
```
13+
14+
## How it works
15+
16+
- The scripts (`fresh-ingest.sh`, `append-ingest.sh`) read `GUIDE_PROFILE` from `.env` (default: `user`)
17+
- Spring profiles become `local,<GUIDE_PROFILE>` → loads `application-<GUIDE_PROFILE>.yml`
18+
- The scripts pass `--spring.config.additional-location=file:./scripts/user-config/` so Spring picks up profiles from this directory
19+
- Personal profiles in `scripts/user-config/` are gitignored (only the `.example` is checked in)
20+
21+
## Ingestion on startup
22+
23+
The `IngestionRunner` only activates when `guide.reload-content-on-startup` is `true`. The default in `application.yml` is `false`, so normal builds (`./mvnw test`, `./mvnw spring-boot:run`) never trigger ingestion. Only the scripts set this flag -- `fresh-ingest.sh` exports `GUIDE_RELOADCONTENTONSTARTUP=true` before launching the app.
24+
25+
## Failure recovery
26+
27+
Ingestion is resilient at every level -- a single failure never prevents the remaining items from being processed:
28+
29+
- **URLs**: each URL is ingested independently. If one times out or returns an error, the rest continue.
30+
- **Directories**: each configured directory is ingested independently. A missing or unreadable directory doesn't block others.
31+
- **Documents within a directory**: each file is written to the store individually. A single unparseable file (e.g. corrupt encoding) doesn't skip the remaining files in that directory.
32+
33+
All failures are collected with their source and reason into the `IngestionResult`, which is:
34+
- Printed in the **INGESTION COMPLETE** banner (so you can see what failed and why at a glance)
35+
- Returned as JSON from `POST /api/v1/data/load-references` for programmatic inspection
36+
37+
## MCP tools
38+
39+
All ingested content -- both URLs and local directories -- is immediately available through the MCP tools (`docs_vectorSearch`, `docs_textSearch`, etc.). The MCP tools and the ingestion pipeline share the same Neo4j store, so there is no separate sync step. Once ingestion completes, MCP clients (Cursor, Claude Desktop, etc.) can search the content right away.

0 commit comments

Comments
 (0)