Skip to content

mesh: set preferred source for WireGuard routes#4

Merged
kvaps merged 2 commits intomainfrom
fix/mesh-route-src
Feb 14, 2026
Merged

mesh: set preferred source for WireGuard routes#4
kvaps merged 2 commits intomainfrom
fix/mesh-route-src

Conversation

@kvaps
Copy link
Member

@kvaps kvaps commented Feb 13, 2026

Set Route.Src for routes installed via the kilo interface to the node private IP when available.

Without an explicit source, the kernel may pick the WireGuard overlay address (for example 100.66.0.x). In environments like Azure SDN this can break return traffic because the overlay address is not routable by the underlay.

Changes in this PR:

  • set Src on WireGuard-routed netlink.Route entries in Topology.Routes;
  • keep behavior unchanged when a node has no private IP (source remains unset);
  • update TestRoutes expectations to assert Src in logical/full topology scenarios.

Summary by CodeRabbit

  • Improvements

    • Improved route handling by selecting a preferred source address for WireGuard traffic, enhancing routing reliability.
  • Tests

    • Updated tests to validate the revised route configurations and source-address behavior.
  • New Features

    • Added a Kubernetes deployment and RBAC resources to synchronize Azure route tables from the cluster, enabling automated route updates across subnets.

Set Route.Src for routes installed via the kilo interface to the node private IP when available.

Without an explicit source, the kernel may pick the WireGuard overlay address (for example 100.66.0.x). In environments like Azure SDN this can break return traffic because the overlay address is not routable by the underlay.

Update route expectations in TestRoutes to assert Src across the affected logical and full topology cases.

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello @kvaps, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances network routing stability by ensuring that WireGuard-based routes use the node's private IP as the source address. This change prevents potential routing failures in specific cloud environments, such as Azure SDN, where the default kernel-selected overlay IP might not be routable by the underlying network fabric. The modification is conditional, preserving existing behavior when a private IP is not configured, and is validated with updated test cases.

Highlights

  • WireGuard Route Source: Explicitly set the source IP (Route.Src) for routes installed via the Kilo interface to the node's private IP when available.
  • Azure SDN Compatibility: Addressed an issue in environments like Azure SDN where the kernel might incorrectly pick the WireGuard overlay address as the source, breaking return traffic.
  • Test Coverage: Updated existing route tests to assert that the Src field is correctly set in logical and full topology scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/mesh/routes.go
    • Introduced logic to determine and set a preferred source IP for WireGuard routes based on the node's private IP.
    • Applied the determined source IP to various netlink.Route entries for WireGuard interfaces.
  • pkg/mesh/routes_test.go
    • Updated test expectations across multiple scenarios to include assertions for the Src field in WireGuard routes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 13, 2026

Warning

Rate limit exceeded

@kvaps has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 9 minutes and 57 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds propagation of a preferred route source IP (Src) derived from a node's private IP across route construction and encapsulation logic in the mesh package, updates tests to include Src, and adds a new Kubernetes manifest/deployment to synchronize Azure route tables from a cluster-side pod.

Changes

Cohort / File(s) Summary
Routes source propagation
pkg/mesh/routes.go
Introduces a local src (from t.privateIP.IP) and propagates Src into netlink.Route constructions for tunl/mesh/encapsulation paths; encapsulateRoute conditionally applies route.Src when a subnet is present.
Route tests
pkg/mesh/routes_test.go
Augments test route literals with Src set to corresponding node InternalIP.IP across TestRoutes to reflect new route Src expectations.
Azure route sync deployment
manifests/kilo-azure-route-sync-deployment.yaml
Adds a ServiceAccount, ClusterRole, ClusterRoleBinding, and a Deployment running an Azure CLI-based sync script that watches nodes for a leader, logs into Azure, and creates/updates route-table entries per AZURE_ROUTES configuration.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant SyncPod as Sync Pod
participant K8sAPI as Kubernetes API
participant LeaderNode as Leader Node (topology label)
participant AzureAPI as Azure REST/API
participant RouteTable as Azure Route Table

SyncPod->>K8sAPI: watch Nodes with topology label
K8sAPI-->>SyncPod: node list/events
SyncPod->>K8sAPI: identify leader node
SyncPod-->>LeaderNode: confirm leader (read labels/status)
SyncPod->>AzureAPI: az login (service principal)
AzureAPI-->>SyncPod: auth token
SyncPod->>AzureAPI: update subnet route-table association
AzureAPI-->>RouteTable: create/update routes (VirtualAppliance next-hop)
RouteTable-->>SyncPod: operation result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop where routes find light and flow,
I tuck a source where packets go,
I peek at leaders, call Azure near,
And stitch the paths so data's clear,
A fluffy sync, in bash and byte, I cheer!

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'mesh: set preferred source for WireGuard routes' directly and accurately describes the main change: setting the Src field on WireGuard routes to use the node's private IP as the preferred source address.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/mesh-route-src

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the issue of the kernel picking unroutable overlay IP addresses for WireGuard routes in certain environments by explicitly setting the source IP. The changes in pkg/mesh/routes.go correctly introduce a preferred source IP based on the node's private IP and apply it to the netlink.Route entries. The corresponding updates in pkg/mesh/routes_test.go ensure that the new behavior is properly tested and validated. The code is clear, concise, and directly implements the described solution.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@manifests/kilo-azure-route-sync-deployment.yaml`:
- Line 89: The runtime call "az aks install-cli --install-location
/usr/local/bin/kubectl >/dev/null" is fragile and must be replaced by using an
image that already contains a pinned kubectl and az; update the
Deployment/container spec to use a custom image (or a base like bitnami/kubectl
with az installed) that bundles a specific kubectl version and az CLI instead of
installing at startup, or build and publish a pinned-version image and reference
that image in the container image field so the startup no longer performs "az
aks install-cli".
- Around line 43-82: The container "sync" in the kilo-azure-route-sync
Deployment runs as root with no securityContext and no resource requests/limits;
add a pod/container securityContext for the "sync" container (referencing name:
sync and image mcr.microsoft.com/azure-cli:2.67.0) setting
allowPrivilegeEscalation: false, capabilities.drop: ["ALL"], runAsNonRoot: true
and a non-zero runAsUser (or runAsUser/runAsGroup at pod level), and optionally
readOnlyRootFilesystem: true — if you enable readOnlyRootFilesystem, mount
emptyDir volumes for paths the azure-cli writes to (eg /usr/local/bin and any az
cache/token dir) and map them into the container; also add reasonable
resources.requests and resources.limits for CPU/memory to avoid OOM/starvation.
🧹 Nitpick comments (3)
manifests/kilo-azure-route-sync-deployment.yaml (3)

115-129: kubectl get -w has no reconnect logic — watch silently exits on API server disconnect.

If the watch connection drops (API server restart, network blip, HTTP2 timeout), the while read loop terminates and the container exits. With restartPolicy: Always (Deployment default) the pod restarts, but it will re-run az login and re-download kubectl each time.

Consider wrapping the watch in an outer while true loop with a brief backoff, or use a controller/informer pattern for resilience.

♻️ Sketch: reconnecting watch loop
+         while true; do
           kubectl get node -w -l topology.kubernetes.io/zone=azure --no-headers \
             -o 'custom-columns=NAME:.metadata.name,LEADER:.metadata.annotations.kilo\.squat\.ai/leader,IP:.status.addresses[?(@.type=="InternalIP")].address' \
             | while read -r n leader ip; do
                 echo "$(date -Iseconds) event node=${n} leader=${leader} ip=${ip}"
                 [ "$leader" = "true" ] || continue
                 az network vnet subnet update \
                   -g "$AZURE_RESOURCE_GROUP" \
                   --vnet-name "$AZURE_VNET_NAME" \
                   -n "$AZURE_SUBNET_NAME" \
                   --route-table "$AZURE_ROUTE_TABLE" >/dev/null
 
                 sync_all_routes "$ip"
 
                 echo "$(date -Iseconds) synced routes to leader ${n} (${ip})"
               done
+           echo "$(date -Iseconds) watch disconnected, reconnecting in 5s..."
+           sleep 5
+         done

76-82: Hardcoded Azure resource names reduce reusability.

AZURE_ROUTE_TABLE, AZURE_VNET_NAME, AZURE_SUBNET_NAME, and AZURE_ROUTES are all hardcoded for a specific environment (workers-serverscom, cozystack-vnet). Consider sourcing these from a ConfigMap or additional Secret refs so the manifest can be reused across environments without modification.


95-101: Silent error swallowing on route create may mask permission or config issues.

The || true on Line 97 is intended to handle "route already exists" but also suppresses auth errors, invalid resource group, etc. The subsequent update (Line 98-100) will catch some of these, but consider checking the create exit code or stderr for specific failure reasons, or using az network route-table route show first to determine if the route exists.

Comment on lines +43 to +82
spec:
serviceAccountName: kilo-azure-route-sync
containers:
- name: sync
image: mcr.microsoft.com/azure-cli:2.67.0
imagePullPolicy: IfNotPresent
env:
- name: AZURE_CLIENT_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: ClientID
- name: AZURE_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: ClientSecret
- name: AZURE_TENANT_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: TenantID
- name: AZURE_SUBSCRIPTION_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: SubscriptionID
- name: AZURE_RESOURCE_GROUP
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: ResourceGroup
- name: AZURE_ROUTE_TABLE
value: kilo-routes-workers-serverscom
- name: AZURE_VNET_NAME
value: cozystack-vnet
- name: AZURE_SUBNET_NAME
value: workers-serverscom
- name: AZURE_ROUTES
value: to-serverscom=192.168.102.0/23
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing security context — container runs as root with writable filesystem.

Static analysis correctly flags multiple security posture gaps. At minimum, add a securityContext to the container:

🔒 Proposed security hardening
       containers:
       - name: sync
         image: mcr.microsoft.com/azure-cli:2.67.0
         imagePullPolicy: IfNotPresent
+        securityContext:
+          readOnlyRootFilesystem: false   # az cli needs writable home; consider tmpfs
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          runAsUser: 65534
+          capabilities:
+            drop: ["ALL"]

Note: readOnlyRootFilesystem: true may conflict with az aks install-cli writing to /usr/local/bin/ and az login caching tokens. If so, mount specific paths as emptyDir volumes. But at a minimum, allowPrivilegeEscalation: false and capabilities.drop: ALL should be set.

Also missing: resource requests/limits — without them, this pod can be OOMKilled or starve neighbors.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
spec:
serviceAccountName: kilo-azure-route-sync
containers:
- name: sync
image: mcr.microsoft.com/azure-cli:2.67.0
imagePullPolicy: IfNotPresent
env:
- name: AZURE_CLIENT_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: ClientID
- name: AZURE_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: ClientSecret
- name: AZURE_TENANT_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: TenantID
- name: AZURE_SUBSCRIPTION_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: SubscriptionID
- name: AZURE_RESOURCE_GROUP
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: ResourceGroup
- name: AZURE_ROUTE_TABLE
value: kilo-routes-workers-serverscom
- name: AZURE_VNET_NAME
value: cozystack-vnet
- name: AZURE_SUBNET_NAME
value: workers-serverscom
- name: AZURE_ROUTES
value: to-serverscom=192.168.102.0/23
spec:
serviceAccountName: kilo-azure-route-sync
containers:
- name: sync
image: mcr.microsoft.com/azure-cli:2.67.0
imagePullPolicy: IfNotPresent
securityContext:
readOnlyRootFilesystem: false # az cli needs writable home; consider tmpfs
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 65534
capabilities:
drop: ["ALL"]
env:
- name: AZURE_CLIENT_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: ClientID
- name: AZURE_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: ClientSecret
- name: AZURE_TENANT_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: TenantID
- name: AZURE_SUBSCRIPTION_ID
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: SubscriptionID
- name: AZURE_RESOURCE_GROUP
valueFrom:
secretKeyRef:
name: cluster-autoscaler-azure-azure-cluster-autoscaler
key: ResourceGroup
- name: AZURE_ROUTE_TABLE
value: kilo-routes-workers-serverscom
- name: AZURE_VNET_NAME
value: cozystack-vnet
- name: AZURE_SUBNET_NAME
value: workers-serverscom
- name: AZURE_ROUTES
value: to-serverscom=192.168.102.0/23
🧰 Tools
🪛 Trivy (0.69.1)

[error] 43-82: Root file system is not read-only

Container 'sync' of Deployment 'kilo-azure-route-sync' should set 'securityContext.readOnlyRootFilesystem' to true

Rule: KSV-0014

Learn more

(IaC/Kubernetes)


[error] 43-82: Default security context configured

container kilo-azure-route-sync in cozy-cluster-autoscaler-azure namespace is using the default security context

Rule: KSV-0118

Learn more

(IaC/Kubernetes)


[error] 40-82: Default security context configured

deployment kilo-azure-route-sync in cozy-cluster-autoscaler-azure namespace is using the default security context, which allows root privileges

Rule: KSV-0118

Learn more

(IaC/Kubernetes)

🤖 Prompt for AI Agents
In `@manifests/kilo-azure-route-sync-deployment.yaml` around lines 43 - 82, The
container "sync" in the kilo-azure-route-sync Deployment runs as root with no
securityContext and no resource requests/limits; add a pod/container
securityContext for the "sync" container (referencing name: sync and image
mcr.microsoft.com/azure-cli:2.67.0) setting allowPrivilegeEscalation: false,
capabilities.drop: ["ALL"], runAsNonRoot: true and a non-zero runAsUser (or
runAsUser/runAsGroup at pod level), and optionally readOnlyRootFilesystem: true
— if you enable readOnlyRootFilesystem, mount emptyDir volumes for paths the
azure-cli writes to (eg /usr/local/bin and any az cache/token dir) and map them
into the container; also add reasonable resources.requests and resources.limits
for CPU/memory to avoid OOM/starvation.

az login --service-principal -u "$AZURE_CLIENT_ID" -p "$AZURE_CLIENT_SECRET" --tenant "$AZURE_TENANT_ID" >/dev/null
az account set --subscription "$AZURE_SUBSCRIPTION_ID"

az aks install-cli --install-location /usr/local/bin/kubectl >/dev/null
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Installing kubectl at runtime is fragile.

az aks install-cli downloads kubectl on every pod start, adding a network dependency and unpinned binary version. If the download endpoint is unavailable or rate-limited, the pod fails to start.

Consider building a custom image that bundles both az and kubectl at a pinned version, or use the bitnami/kubectl image with az installed, to make startup deterministic.

🤖 Prompt for AI Agents
In `@manifests/kilo-azure-route-sync-deployment.yaml` at line 89, The runtime call
"az aks install-cli --install-location /usr/local/bin/kubectl >/dev/null" is
fragile and must be replaced by using an image that already contains a pinned
kubectl and az; update the Deployment/container spec to use a custom image (or a
base like bitnami/kubectl with az installed) that bundles a specific kubectl
version and az CLI instead of installing at startup, or build and publish a
pinned-version image and reference that image in the container image field so
the startup no longer performs "az aks install-cli".

@kvaps kvaps force-pushed the fix/mesh-route-src branch from 95836f7 to 150671c Compare February 14, 2026 01:32
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps force-pushed the fix/mesh-route-src branch from 150671c to a9fedec Compare February 14, 2026 01:34
@kvaps kvaps merged commit b727cfc into main Feb 14, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant