Skip to content

Add continuous graph updates via Git webhook and poll watcher#615

Open
Copilot wants to merge 6 commits intostagingfrom
copilot/add-continuous-graph-updates
Open

Add continuous graph updates via Git webhook and poll watcher#615
Copilot wants to merge 6 commits intostagingfrom
copilot/add-continuous-graph-updates

Conversation

Copy link
Contributor

Copilot AI commented Mar 13, 2026

Full re-indexing on every change is too slow for large repos. This adds an incremental update engine that computes a diff between two commit SHAs and only touches affected files, plus two trigger modes: a GitHub/GitLab push webhook and a background poll watcher.

Core engine — api/git_utils/incremental_update.py

  • incremental_update(repo_name, from_sha, to_sha, ignore=[]) — resolves both SHAs via pygit2, classifies file changes (added/modified/deleted), checks out to_sha, deletes stale nodes/edges, re-analyses changed files, and persists the new commit bookmark in Redis. Idempotent (from_sha == to_sha is a no-op). Accepts abbreviated or full SHAs.
  • fetch_remote(repo_path)git fetch origin via subprocess
  • get_remote_head(repo_path, branch) — returns remote branch HEAD SHA
  • repo_local_path(repo_name) — resolves clone path; respects REPOSITORIES_DIR env override

Webhook endpoint — POST /api/webhook

Accepts GitHub/GitLab push event payloads. When WEBHOOK_SECRET is set, validates X-Hub-Signature-256 with hmac.compare_digest (timing-safe). Ignores pushes to untracked branches (200 response, no retry). Resolves the target repo by URL-matching against indexed repos (normalises .git suffix and case).

// GitHub push event → triggers incremental update for the matched repo
{ "ref": "refs/heads/main", "before": "<sha>", "after": "<sha>",
  "repository": { "clone_url": "https://github.com/org/repo.git" } }

Background poll watcher

Started via FastAPI lifespan on startup (cancelled cleanly on shutdown). At each POLL_INTERVAL tick, fetches all indexed repos, compares stored commit SHA against origin/<TRACKED_BRANCH>, and calls incremental_update if behind. Handles short vs. full SHA comparison correctly (prefix match only when lengths differ).

Configuration — new env vars (documented in .env.template)

Variable Default Purpose
WEBHOOK_SECRET (empty) HMAC-SHA256 secret for webhook signature validation
TRACKED_BRANCH main Branch to watch for updates
POLL_INTERVAL 60 Seconds between poll checks; 0 disables the watcher
Original prompt

This section details on the original issue you should resolve

<issue_title>Continuous graph updates via Git webhook / branch watcher</issue_title>
<issue_description>## Summary

Add the ability for code-graph to stay in sync with a repository by automatically updating the graph on each commit to a tracked branch (e.g. main). Instead of re-indexing the entire codebase on every change, the system should compute a diff-based incremental update — only processing files that were added, modified, or deleted in the commit.

Motivation

Currently code-graph requires a full re-index to reflect codebase changes. For large repositories this is slow and wasteful. Continuous incremental updates would make code-graph viable as a live knowledge source for AI-assisted development tools (e.g. Claude Code via MCP), CI pipelines, and developer dashboards — where the graph must reflect the latest state of main at all times.

Proposed Behavior

  1. Trigger: On each push/merge to the tracked branch, the system receives a notification (Git webhook, polling, or filesystem watch).
  2. Diff extraction: Determine which files were added, modified, or deleted in the commit(s) since the last indexed commit SHA.
  3. Incremental graph update:
    • Deleted files — remove all nodes and edges originating from those files.
    • Modified files — remove existing nodes/edges for the file, re-parse, and re-insert.
    • Added files — parse and insert new nodes and edges.
    • Cross-file edges — recompute edges (calls, imports, inheritance) that involve any touched file, and prune stale edges whose targets no longer exist.
  4. Bookmark: Persist the last successfully indexed commit SHA so the system can resume correctly after restarts or failures.

Design Considerations

  • Atomicity — Graph updates for a single commit should be applied as a transaction so queries never see a half-updated state. Consider wrapping the delete + re-insert cycle in a FalkorDB transaction or using a shadow-graph swap approach for larger changesets.
  • Batch commits — If the watcher falls behind (e.g. service was down), it should be able to squash multiple commits into a single cumulative diff rather than replaying one-by-one.
  • Trigger modes — Support at least two modes:
    • Webhook — HTTP endpoint that receives a GitHub/GitLab push event payload.
    • Poll — Periodically check the remote branch HEAD and update if it has advanced.
    • (Optional) Filesystem watch — for local-only setups using inotify/fswatch on a bare repo.
  • Concurrency — Graph reads (MCP queries, API requests) should not be blocked during an update. Consider read/write isolation or short lock windows.
  • Idempotency — Re-processing the same commit SHA should be a no-op.
  • Logging & observability — Each update cycle should log: trigger commit SHA, files affected, nodes/edges added/removed, duration, and any parse errors (with the update continuing past unparseable files).

Suggested Implementation Phases

Phase 1 — Core incremental update engine

  • Given a before/after commit SHA pair, compute the file diff, update the graph accordingly, and persist the new bookmark.
  • Unit-testable in isolation (no webhook needed, just call with two SHAs).

Phase 2 — Trigger integration

  • Add the webhook HTTP endpoint (GitHub/GitLab push event format).
  • Add the poll-based watcher as an alternative.
  • Configuration: tracked branch name, poll interval, webhook secret.

Phase 3 — Robustness & observability

  • Batch catch-up for missed commits.
  • Metrics endpoint or structured logs (commits processed, lag, errors).
  • Graceful handling of force-pushes / rebases (detect non-fast-forward and trigger a full re-index as fallback).

Acceptance Criteria

  • Pushing a commit to main that adds a new function results in the corresponding node and call-edges appearing in the graph within a configurable time window (default < 30s for webhook mode).
  • Renaming/moving a function removes the old node and creates a new one with correct edges.
  • Deleting a file removes all its nodes and any dangling edges.
  • The system recovers cleanly after a restart, picking up from the last indexed SHA.
  • A full re-index can still be triggered manually as a fallback.
    </issue_description>

Comments on the Issue (you are @copilot in this section)


💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added webhook endpoint to trigger incremental graph updates via push events
    • Added automatic background polling to keep the graph continuously updated based on repository changes
    • Introduced configurable webhook security (HMAC-SHA256 validation) and polling intervals
  • Documentation

    • Updated configuration guide with new environment variables for webhook and polling settings
    • Documented the new webhook endpoint for receiving push events

- Add api/git_utils/incremental_update.py with incremental_update(),
  fetch_remote(), get_remote_head(), and repo_local_path() helpers
- Export new functions from api/git_utils/__init__.py
- Add POST /api/webhook endpoint with HMAC-SHA256 validation,
  branch filtering, and repo URL matching
- Add background poll watcher via FastAPI lifespan (_poll_loop,
  _poll_all_repos, _poll_repo)
- Add WEBHOOK_SECRET, TRACKED_BRANCH, POLL_INTERVAL env vars
- Document new env vars in .env.template
- Add tests/test_webhook.py with unit tests"

Co-authored-by: gkorland <753206+gkorland@users.noreply.github.com>
logger.exception(
"Webhook: incremental update failed for '%s': %s", repo_name, exc
)
return JSONResponse({"status": "error", "detail": str(exc)}, status_code=500)

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 2 days ago

In general, the fix is to avoid sending raw exception details back to the client. Instead, log the full exception on the server (which is already being done via logger.exception) and return a generic, non-sensitive error message (and optionally a simple error code) in the HTTP response. This preserves current behavior in terms of status codes and success/error structure, but eliminates leakage of internal details.

Concretely in api/index.py, within the webhook function’s try/except block, we should change the JSONResponse returned in the except clause at line 679 so that it no longer contains str(exc) in the detail field. A suitable replacement is a constant, generic message such as "Internal server error during webhook processing". Because logger.exception already records the stack trace and exception message, no extra logging changes are needed. No new imports or helpers are required; we only adjust the literal string in the JSON payload.


Suggested changeset 1
api/index.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/api/index.py b/api/index.py
--- a/api/index.py
+++ b/api/index.py
@@ -676,7 +676,13 @@
         logger.exception(
             "Webhook: incremental update failed for '%s': %s", repo_name, exc
         )
-        return JSONResponse({"status": "error", "detail": str(exc)}, status_code=500)
+        return JSONResponse(
+            {
+                "status": "error",
+                "detail": "Internal server error during webhook processing",
+            },
+            status_code=500,
+        )
 
     return {"status": "success", **result}
 
EOF
@@ -676,7 +676,13 @@
logger.exception(
"Webhook: incremental update failed for '%s': %s", repo_name, exc
)
return JSONResponse({"status": "error", "detail": str(exc)}, status_code=500)
return JSONResponse(
{
"status": "error",
"detail": "Internal server error during webhook processing",
},
status_code=500,
)

return {"status": "success", **result}

Copilot is powered by AI and may make mistakes. Always verify output.
Copilot AI changed the title [WIP] Add continuous graph updates via Git webhook Add continuous graph updates via Git webhook and poll watcher Mar 13, 2026
Copilot AI requested a review from gkorland March 13, 2026 17:30
@gkorland gkorland marked this pull request as ready for review March 14, 2026 08:54
gkorland and others added 3 commits March 14, 2026 10:59
Resolve PR #615 conflicts, preserve the continuous update configuration, and stabilize the webhook incremental-update tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gkorland gkorland requested a review from Copilot March 14, 2026 17:27
@gkorland
Copy link
Contributor

@CodeRabbit review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d6307529-7c56-43a9-ab3f-1308907d5e6c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request implements continuous incremental graph updates via webhooks and background polling. It adds configuration for webhook secrets, branch tracking, and polling intervals; introduces the incremental update engine for diff-based graph synchronization; exposes a POST /api/webhook endpoint with HMAC-SHA256 signature validation; and includes a background poll-watcher task with comprehensive test coverage.

Changes

Cohort / File(s) Summary
Configuration
.env.template
Added three environment variables: WEBHOOK_SECRET (for HMAC validation), TRACKED_BRANCH (default: main), and POLL_INTERVAL (polling interval in seconds; 0 disables).
Documentation
README.md
Documented new environment variables and described the /api/webhook endpoint for receiving push events and triggering incremental graph updates.
Core Incremental Update Engine
api/git_utils/incremental_update.py
Implements incremental graph updates: repo_local_path() resolves repository paths; fetch_remote() pulls latest refs; get_remote_head() retrieves branch HEAD; incremental_update() computes file diffs, removes deleted/modified nodes from the graph, and inserts/updates added/modified files. Returns summary with counts of changes and persists the updated commit bookmark in Redis.
API Integration & Webhook Handler
api/index.py
Added POST /api/webhook endpoint with optional HMAC-SHA256 signature validation, branch filtering via TRACKED_BRANCH, and invocation of incremental_update(). Integrated background poll-watcher task managed via FastAPI lifespan context manager to periodically fetch and update indexed repositories. Added repository URL matching utilities for webhook-to-repo resolution.
Public API Refactoring
api/git_utils/__init__.py
Replaced wildcard imports with explicit, named re-exports and an all list to clearly define the public API (GitRepoName, GitGraph, build_commit_graph, classify_changes, fetch_remote, get_remote_head, incremental_update, is_ignored, repo_local_path, switch_commit).
Comprehensive Test Suite
tests/test_webhook.py
Tests for URL matching logic, webhook endpoint behavior (open mode and HMAC-secured mode), payload validation, signature verification, and incremental update unit tests covering idempotence and error handling. Includes payload generation and signing utilities.

Sequence Diagram(s)

sequenceDiagram
    actor Client as Push Event/<br/>Polling Loop
    participant API as FastAPI Server<br/>/api/webhook
    participant Repo as Git Repository
    participant Analyzer as Source Analyzer
    participant Graph as FalkorDB<br/>Graph Database
    participant Redis as Redis<br/>(Bookmark)
    
    alt Webhook Path
        Client->>API: POST /api/webhook<br/>(with signature & payload)
        API->>API: Validate HMAC-SHA256<br/>Verify branch match
        API->>Repo: Extract repo from payload
    else Polling Path
        Client->>API: Background poll-watcher<br/>triggers periodically
        API->>Repo: Fetch remote & check HEAD
    end
    
    API->>Repo: Resolve from_sha, to_sha<br/>(current & latest)
    Repo->>API: Return commit objects
    API->>Repo: Compute file diff<br/>(added, modified, deleted)
    Repo->>API: Return file changeset
    
    API->>Repo: Checkout target commit
    Repo->>Repo: Update working tree
    
    Note over Graph,Analyzer: Process Changed Files
    
    par Remove Deleted/Modified
        API->>Graph: DELETE nodes & edges<br/>for removed/modified files
    and Analyze & Insert Changed
        API->>Analyzer: Analyze added/<br/>modified files
        Analyzer->>API: Return AST/symbols
        API->>Graph: INSERT new nodes/<br/>UPDATE existing edges
    end
    
    Graph->>Graph: Return change summary<br/>(added, modified, deleted counts)
    
    API->>Redis: Persist new commit SHA<br/>as bookmark
    Redis->>API: Acknowledge
    
    API->>Client: Return 200 with<br/>update summary
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A webhook hops, a poller twitches,
Git diffs dance through incremental switches,
No full re-graph, just delta delights,
Graphs stay fresh through the day and nights!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main changes: adding Git webhook and poll watcher capabilities for continuous graph updates.
Linked Issues check ✅ Passed The PR implements core requirements from #614: incremental update engine with diff-based file changes, webhook endpoint with signature validation, poll watcher with configurable intervals, and persistence of commit bookmarks via Redis.
Out of Scope Changes check ✅ Passed All changes align with #614 objectives. The explicit all in init.py clarifies the public API for the new incremental_update module without introducing unrelated functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch copilot/add-continuous-graph-updates
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a continuous/incremental graph update mechanism to avoid full re-indexes on every repo change, integrating both a push-webhook trigger and a background poll-watcher into the FastAPI backend.

Changes:

  • Added api/git_utils/incremental_update.py to compute file-level diffs between two SHAs and update the FalkorDB graph incrementally while persisting the new commit bookmark in Redis.
  • Added POST /api/webhook plus URL-matching helpers and FastAPI lifespan-managed poll-watcher to trigger incremental updates automatically.
  • Documented new environment variables and the new webhook endpoint in .env.template and README.md, plus added unit tests for the webhook and incremental-update helpers.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
api/index.py Adds webhook endpoint, URL matching, poll-watcher loop, and lifespan wiring to trigger incremental updates.
api/git_utils/incremental_update.py Implements diff-based incremental update flow (checkout, delete stale file nodes, re-analyze changed files, update Redis bookmark).
api/git_utils/__init__.py Replaces wildcard export with explicit exports, including the new incremental update helpers.
tests/test_webhook.py Adds unit tests for URL matching, webhook behavior (open/secured), and basic incremental_update edge cases.
README.md Documents new env vars and the /api/webhook endpoint.
.env.template Adds WEBHOOK_SECRET, TRACKED_BRANCH, and POLL_INTERVAL configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +194 to +201
logger.info(
"Poll: new commits detected for '%s' (%s -> %s), updating …",
repo_name, current_sha, remote_head,
)
try:
result = incremental_update(repo_name, current_sha, remote_head)
logger.info("Poll: '%s' updated — %s", repo_name, result)
except Exception as exc:
Comment on lines +497 to +500
def _update() -> dict:
path = repo_local_path(repo_name)
if path.exists():
fetch_remote(path)
Comment on lines +438 to +482
"""Receive a GitHub/GitLab push event and trigger an incremental graph update.

When ``WEBHOOK_SECRET`` is set the endpoint validates the
``X-Hub-Signature-256`` header using HMAC-SHA256; requests with a missing
or invalid signature are rejected with **401 Unauthorized**.

Only pushes to the branch configured via ``TRACKED_BRANCH`` (default
``main``) trigger an update; pushes to other branches are acknowledged
with a ``200 ignored`` response so that GitHub does not retry them.

The repository is identified by matching the ``repository.clone_url``
field in the payload against the URLs stored for already-indexed
repositories.
"""
body = await request.body()

# Validate HMAC-SHA256 signature when a secret is configured
if WEBHOOK_SECRET:
sig_header = request.headers.get("X-Hub-Signature-256", "")
mac = hmac.new(WEBHOOK_SECRET.encode(), body, hashlib.sha256)
expected_sig = "sha256=" + mac.hexdigest()
if not hmac.compare_digest(sig_header, expected_sig):
raise HTTPException(status_code=401, detail="Invalid webhook signature")

try:
payload = await request.json()
except Exception:
raise HTTPException(status_code=400, detail="Invalid JSON payload")

ref = payload.get("ref", "")
before = payload.get("before", "")
after = payload.get("after", "")
repo_url = payload.get("repository", {}).get("clone_url", "")

# Only process pushes to the configured tracked branch
expected_ref = f"refs/heads/{TRACKED_BRANCH}"
if ref != expected_ref:
logger.debug("Webhook: ignoring push to '%s' (tracking '%s')", ref, expected_ref)
return {"status": "ignored", "reason": f"Branch not tracked: {ref}"}

if not before or not after or not repo_url:
raise HTTPException(
status_code=400,
detail="Payload missing required fields: ref, before, after, repository.clone_url",
)
Comment on lines +120 to +129
if from_sha == to_sha:
logger.info(
"incremental_update: from_sha == to_sha (%s); nothing to do", from_sha
)
return {
"files_added": 0,
"files_modified": 0,
"files_deleted": 0,
"commit": to_sha,
}
Comment on lines +155 to +193
def _poll_repo(repo_name: str) -> None:
"""Fetch remote and apply incremental updates for *repo_name* if behind.

This function is intentionally synchronous so it can be safely offloaded
to ``asyncio``'s default ``ThreadPoolExecutor``.
"""
path = repo_local_path(repo_name)
if not path.exists():
logger.debug("Poll: local clone not found for '%s', skipping", repo_name)
return

try:
fetch_remote(path)
except Exception as exc:
logger.warning("Poll: git fetch failed for '%s': %s", repo_name, exc)
return

remote_head = get_remote_head(path, TRACKED_BRANCH)
if not remote_head:
return

current_sha = get_repo_commit(repo_name)
if not current_sha:
logger.debug("Poll: no stored commit for '%s', skipping", repo_name)
return

# Handle comparison between short (7-char) and full (40-char) SHAs: a short
# stored SHA is a valid prefix of a full remote SHA for the same commit.
# We only apply prefix matching when the stored SHA is shorter.
if len(current_sha) < len(remote_head):
up_to_date = remote_head.startswith(current_sha)
elif len(current_sha) > len(remote_head):
up_to_date = current_sha.startswith(remote_head)
else:
up_to_date = current_sha == remote_head
if up_to_date:
logger.debug("Poll: '%s' is up-to-date at %s", repo_name, current_sha)
return

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/git_utils/incremental_update.py`:
- Around line 172-180: The current update deletes definitions for
deleted+modified files via g.delete_files and re-analyzes only added+modified
via analyzer.analyze_files, which leaves inbound callers/importers stale; after
computing files_to_add and files_to_remove, collect the transitive set of
dependent files that import or call those changed files (using the graph g's
reverse edges / dependency lookup), add those dependent filenames to
files_to_add (excluding already-deleted files), and then call
analyzer.analyze_files on this expanded files_to_add so
SourceAnalyzer.analyze_files reprocesses untouched callers and restores inbound
edges; ensure you use g's dependency/query methods (the graph instance g) to
find dependents before calling analyzer.analyze_files.
- Around line 166-185: Wrap the entire mutation sequence that performs
repo.checkout_tree/to_commit set_head_detached, Graph(repo_name) updates,
analyzer.analyze_files(...) and set_repo_commit(...) in an exclusive repo-scoped
lock (keyed on repo_name) so concurrent runs for the same repo cannot
interleave; acquire the lock before calling repo.checkout_tree and hold it until
after set_repo_commit (or until commit of all graph changes), use a
distributed-lock primitive if you have multiple workers, set a sensible timeout
and ensure the lock is always released in a finally/cleanup block and that
errors are logged/propagated while still releasing the lock.

In `@api/index.py`:
- Around line 497-505: The code calls incremental_update(repo_name, before,
after) using the webhook payload.before; instead, retrieve the stored bookmark
via get_repo_commit(repo_name) and use that as from_sha when calling
incremental_update (i.e., pass get_repo_commit(repo_name) as the first/from
argument), but detect mismatch: if the stored bookmark is missing or does not
equal the graph's current commit that lines up with the push history (or cannot
reach payload.before), fall back to performing a full reindex of the repo (call
the existing full reindex routine) rather than running a partial
incremental_update; update the _update closure (and its call site around
repo_local_path, fetch_remote, and loop.run_in_executor) to implement this
branching logic so the graph never advances past gaps.
- Around line 114-122: The webhook endpoint is anonymously writable because
WEBHOOK_SECRET defaults to empty; update startup and the /api/webhook handler so
webhook auth is mandatory: at startup (when reading WEBHOOK_SECRET) fail fast
with a clear error if it's empty in production mode, or modify the webhook route
handler to apply token_required when WEBHOOK_SECRET is not set (use the existing
token_required decorator) so the mutating endpoint is never unauthenticated;
ensure you update the webhook handler (the /api/webhook function) to prefer HMAC
verification when WEBHOOK_SECRET is set and fallback to token_required
otherwise, and add a clear log message indicating which auth mode is in effect.
- Around line 455-470: The webhook handler currently enforces GitHub-specific
headers and payload fields (it reads X-Hub-Signature-256 into sig_header, builds
expected_sig from WEBHOOK_SECRET, and reads repository.clone_url into repo_url),
which rejects GitLab webhooks; update the logic in the handler to detect GitLab
deliveries by checking for X-Gitlab-Token or X-Gitlab-Signature when
X-Hub-Signature-256 is absent, validate the token/signature using the configured
secret (respecting GitLab’s verification method), and when parsing payload fall
back to repository.git_http_url or project.git_http_url if repository.clone_url
is missing; alternatively, if you prefer to keep GitHub-only behavior, update
documentation to state the webhook supports GitHub only and explicitly fail with
a clear message when GitLab headers are present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cea312f7-19cf-400b-a510-c28520dca5e1

📥 Commits

Reviewing files that changed from the base of the PR and between 887b82f and 40441bf.

📒 Files selected for processing (6)
  • .env.template
  • README.md
  • api/git_utils/__init__.py
  • api/git_utils/incremental_update.py
  • api/index.py
  • tests/test_webhook.py

Comment on lines +166 to +185
repo.checkout_tree(to_commit.tree, strategy=CheckoutStrategy.FORCE)
repo.set_head_detached(to_commit.id)

# Apply graph changes
g = Graph(repo_name)

files_to_remove = deleted + modified
if files_to_remove:
logger.info("Removing %d file(s) from graph", len(files_to_remove))
g.delete_files(files_to_remove)

files_to_add = added + modified
if files_to_add:
logger.info("Inserting/updating %d file(s) in graph", len(files_to_add))
analyzer.analyze_files(files_to_add, repo_path, g)

# Persist the new commit bookmark using the short ID for consistency
# with the rest of the system (build_commit_graph, analyze_sources …)
new_commit_short = to_commit.short_id
set_repo_commit(repo_name, new_commit_short)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard the entire update with a repo-scoped lock.

This function force-checks out to_sha, mutates graph state, and then advances the Redis bookmark. api/index.py can reach it from both the poll watcher and /api/webhook, so overlapping runs for the same repo can interleave and leave the checkout, graph, and bookmark describing different commits. Wrap the whole mutation in an exclusive repo-scoped lock here (or a distributed lock if you run multiple workers).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/git_utils/incremental_update.py` around lines 166 - 185, Wrap the entire
mutation sequence that performs repo.checkout_tree/to_commit set_head_detached,
Graph(repo_name) updates, analyzer.analyze_files(...) and set_repo_commit(...)
in an exclusive repo-scoped lock (keyed on repo_name) so concurrent runs for the
same repo cannot interleave; acquire the lock before calling repo.checkout_tree
and hold it until after set_repo_commit (or until commit of all graph changes),
use a distributed-lock primitive if you have multiple workers, set a sensible
timeout and ensure the lock is always released in a finally/cleanup block and
that errors are logged/propagated while still releasing the lock.

Comment on lines +172 to +180
files_to_remove = deleted + modified
if files_to_remove:
logger.info("Removing %d file(s) from graph", len(files_to_remove))
g.delete_files(files_to_remove)

files_to_add = added + modified
if files_to_add:
logger.info("Inserting/updating %d file(s) in graph", len(files_to_add))
analyzer.analyze_files(files_to_add, repo_path, g)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Recompute reverse dependency edges, not just the changed files.

g.delete_files(deleted + modified) removes the old definitions and their relationships, but analyzer.analyze_files(files_to_add, repo_path, g) only revisits added + modified. From the provided SourceAnalyzer.analyze_files() snippet, untouched callers/importers never get reprocessed, so inbound cross-file edges to the changed files stay missing or stale after the update.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/git_utils/incremental_update.py` around lines 172 - 180, The current
update deletes definitions for deleted+modified files via g.delete_files and
re-analyzes only added+modified via analyzer.analyze_files, which leaves inbound
callers/importers stale; after computing files_to_add and files_to_remove,
collect the transitive set of dependent files that import or call those changed
files (using the graph g's reverse edges / dependency lookup), add those
dependent filenames to files_to_add (excluding already-deleted files), and then
call analyzer.analyze_files on this expanded files_to_add so
SourceAnalyzer.analyze_files reprocesses untouched callers and restores inbound
edges; ensure you use g's dependency/query methods (the graph instance g) to
find dependents before calling analyzer.analyze_files.

Comment on lines +114 to +122
# HMAC-SHA256 secret shared with GitHub/GitLab. Leave unset to skip
# signature validation (not recommended for production).
WEBHOOK_SECRET: str = os.getenv("WEBHOOK_SECRET", "")

# Branch whose pushes trigger incremental graph updates.
TRACKED_BRANCH: str = os.getenv("TRACKED_BRANCH", "main")

# Seconds between automatic poll checks (0 = disabled).
POLL_INTERVAL: int = int(os.getenv("POLL_INTERVAL", "60"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not leave /api/webhook anonymously writable by default.

WEBHOOK_SECRET defaults to empty, and this mutating route does not use token_required, so a stock deployment exposes an unauthenticated endpoint that can fetch repos and rewrite graph state. Make webhook auth mandatory at startup, or fall back to token_required when webhook signing is not configured. As per coding guidelines "api/index.py: Use token_required decorator for mutating endpoint authorization".

Also applies to: 436-460

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/index.py` around lines 114 - 122, The webhook endpoint is anonymously
writable because WEBHOOK_SECRET defaults to empty; update startup and the
/api/webhook handler so webhook auth is mandatory: at startup (when reading
WEBHOOK_SECRET) fail fast with a clear error if it's empty in production mode,
or modify the webhook route handler to apply token_required when WEBHOOK_SECRET
is not set (use the existing token_required decorator) so the mutating endpoint
is never unauthenticated; ensure you update the webhook handler (the
/api/webhook function) to prefer HMAC verification when WEBHOOK_SECRET is set
and fallback to token_required otherwise, and add a clear log message indicating
which auth mode is in effect.

api/index.py Outdated
Comment on lines +455 to +470
if WEBHOOK_SECRET:
sig_header = request.headers.get("X-Hub-Signature-256", "")
mac = hmac.new(WEBHOOK_SECRET.encode(), body, hashlib.sha256)
expected_sig = "sha256=" + mac.hexdigest()
if not hmac.compare_digest(sig_header, expected_sig):
raise HTTPException(status_code=401, detail="Invalid webhook signature")

try:
payload = await request.json()
except Exception:
raise HTTPException(status_code=400, detail="Invalid JSON payload")

ref = payload.get("ref", "")
before = payload.get("before", "")
after = payload.get("after", "")
repo_url = payload.get("repository", {}).get("clone_url", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

GitLab push webhook secret header name and repository URL fields in the push payload. Does GitLab send X-Hub-Signature-256andrepository.clone_url, or does it use X-Gitlab-Tokenplusrepository.git_http_url/project.git_http_url?

💡 Result:

  • Secret/verification header: GitLab’s configured Secret token is sent in the X-Gitlab-Token HTTP header (not X-Hub-Signature-256, which is GitHub’s). [1]

    • Separately, GitLab has/introduced webhook signing headers X-Gitlab-Signature and X-Gitlab-Signature-Timestamp (HMAC-SHA256), but these are GitLab-specific headers—still not X-Hub-Signature-256. [2]
  • Repository URL fields in Push payload: GitLab’s push payload includes repository.git_http_url / repository.git_ssh_url and also a project object with project.git_http_url / project.git_ssh_url (and notes older project.http_url is deprecated). It does not use repository.clone_url (that’s a GitHub-style field). [3]

Sources
[1] GitLab Docs — Webhooks (Secret token header X-Gitlab-Token) (docs.gitlab.com)
[2] GitLab issue — webhook signing headers X-Gitlab-Signature / X-Gitlab-Signature-Timestamp (gitlab.com)
[3] GitLab Docs — Webhook events (Push event payload fields incl. repository.git_http_url, project.git_http_url, deprecations) (docs.gitlab.com)

Citations:


Webhook handler only supports GitHub, not GitLab as documented.

The code checks for the X-Hub-Signature-256 header and reads repository.clone_url, which are GitHub-specific. GitLab webhooks use X-Gitlab-Token or X-Gitlab-Signature headers and provide repository URLs in repository.git_http_url or project.git_http_url. This will cause GitLab deliveries to be rejected or mis-parsed. Either add GitLab-specific logic here or update documentation to reflect GitHub-only support.

🧰 Tools
🪛 Ruff (0.15.5)

[warning] 464-464: Do not catch blind exception: Exception

(BLE001)


[warning] 465-465: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/index.py` around lines 455 - 470, The webhook handler currently enforces
GitHub-specific headers and payload fields (it reads X-Hub-Signature-256 into
sig_header, builds expected_sig from WEBHOOK_SECRET, and reads
repository.clone_url into repo_url), which rejects GitLab webhooks; update the
logic in the handler to detect GitLab deliveries by checking for X-Gitlab-Token
or X-Gitlab-Signature when X-Hub-Signature-256 is absent, validate the
token/signature using the configured secret (respecting GitLab’s verification
method), and when parsing payload fall back to repository.git_http_url or
project.git_http_url if repository.clone_url is missing; alternatively, if you
prefer to keep GitHub-only behavior, update documentation to state the webhook
supports GitHub only and explicitly fail with a clear message when GitLab
headers are present.

Comment on lines +497 to +505
def _update() -> dict:
path = repo_local_path(repo_name)
if path.exists():
fetch_remote(path)
return incremental_update(repo_name, before, after)

loop = asyncio.get_running_loop()
try:
result = await loop.run_in_executor(None, _update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use the stored repo bookmark as from_sha.

incremental_update() assumes from_sha matches the graph's current commit, but the webhook path always forwards payload.before. If the service missed an earlier delivery or restarted behind, the stored bookmark can lag behind before, and this update will skip the missing diff range while still moving the graph to after. Read get_repo_commit(repo_name) here and fall back to a full reindex when it does not line up with the push event.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/index.py` around lines 497 - 505, The code calls
incremental_update(repo_name, before, after) using the webhook payload.before;
instead, retrieve the stored bookmark via get_repo_commit(repo_name) and use
that as from_sha when calling incremental_update (i.e., pass
get_repo_commit(repo_name) as the first/from argument), but detect mismatch: if
the stored bookmark is missing or does not equal the graph's current commit that
lines up with the push history (or cannot reach payload.before), fall back to
performing a full reindex of the repo (call the existing full reindex routine)
rather than running a partial incremental_update; update the _update closure
(and its call site around repo_local_path, fetch_remote, and
loop.run_in_executor) to implement this branching logic so the graph never
advances past gaps.

Reprocess dependent files during incremental updates, add repo-scoped update locking, harden webhook auth and provider handling, and fall back to full reindex when the stored bookmark no longer matches incoming history.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
if poll_task is not None:
poll_task.cancel()
with contextlib.suppress(asyncio.CancelledError):
await poll_task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Continuous graph updates via Git webhook / branch watcher

3 participants