mesh: set preferred source for WireGuard routes#4
Conversation
Set Route.Src for routes installed via the kilo interface to the node private IP when available. Without an explicit source, the kernel may pick the WireGuard overlay address (for example 100.66.0.x). In environments like Azure SDN this can break return traffic because the overlay address is not routable by the underlay. Update route expectations in TestRoutes to assert Src across the affected logical and full topology cases. Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Summary of ChangesHello @kvaps, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances network routing stability by ensuring that WireGuard-based routes use the node's private IP as the source address. This change prevents potential routing failures in specific cloud environments, such as Azure SDN, where the default kernel-selected overlay IP might not be routable by the underlying network fabric. The modification is conditional, preserving existing behavior when a private IP is not configured, and is validated with updated test cases. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughAdds propagation of a preferred route source IP (Src) derived from a node's private IP across route construction and encapsulation logic in the mesh package, updates tests to include Src, and adds a new Kubernetes manifest/deployment to synchronize Azure route tables from a cluster-side pod. Changes
Sequence Diagram(s)mermaid SyncPod->>K8sAPI: watch Nodes with topology label Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
The pull request effectively addresses the issue of the kernel picking unroutable overlay IP addresses for WireGuard routes in certain environments by explicitly setting the source IP. The changes in pkg/mesh/routes.go correctly introduce a preferred source IP based on the node's private IP and apply it to the netlink.Route entries. The corresponding updates in pkg/mesh/routes_test.go ensure that the new behavior is properly tested and validated. The code is clear, concise, and directly implements the described solution.
d2f4ee2 to
95836f7
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@manifests/kilo-azure-route-sync-deployment.yaml`:
- Line 89: The runtime call "az aks install-cli --install-location
/usr/local/bin/kubectl >/dev/null" is fragile and must be replaced by using an
image that already contains a pinned kubectl and az; update the
Deployment/container spec to use a custom image (or a base like bitnami/kubectl
with az installed) that bundles a specific kubectl version and az CLI instead of
installing at startup, or build and publish a pinned-version image and reference
that image in the container image field so the startup no longer performs "az
aks install-cli".
- Around line 43-82: The container "sync" in the kilo-azure-route-sync
Deployment runs as root with no securityContext and no resource requests/limits;
add a pod/container securityContext for the "sync" container (referencing name:
sync and image mcr.microsoft.com/azure-cli:2.67.0) setting
allowPrivilegeEscalation: false, capabilities.drop: ["ALL"], runAsNonRoot: true
and a non-zero runAsUser (or runAsUser/runAsGroup at pod level), and optionally
readOnlyRootFilesystem: true — if you enable readOnlyRootFilesystem, mount
emptyDir volumes for paths the azure-cli writes to (eg /usr/local/bin and any az
cache/token dir) and map them into the container; also add reasonable
resources.requests and resources.limits for CPU/memory to avoid OOM/starvation.
🧹 Nitpick comments (3)
manifests/kilo-azure-route-sync-deployment.yaml (3)
115-129:kubectl get -whas no reconnect logic — watch silently exits on API server disconnect.If the watch connection drops (API server restart, network blip, HTTP2 timeout), the
while readloop terminates and the container exits. WithrestartPolicy: Always(Deployment default) the pod restarts, but it will re-runaz loginand re-downloadkubectleach time.Consider wrapping the watch in an outer
while trueloop with a brief backoff, or use a controller/informer pattern for resilience.♻️ Sketch: reconnecting watch loop
+ while true; do kubectl get node -w -l topology.kubernetes.io/zone=azure --no-headers \ -o 'custom-columns=NAME:.metadata.name,LEADER:.metadata.annotations.kilo\.squat\.ai/leader,IP:.status.addresses[?(@.type=="InternalIP")].address' \ | while read -r n leader ip; do echo "$(date -Iseconds) event node=${n} leader=${leader} ip=${ip}" [ "$leader" = "true" ] || continue az network vnet subnet update \ -g "$AZURE_RESOURCE_GROUP" \ --vnet-name "$AZURE_VNET_NAME" \ -n "$AZURE_SUBNET_NAME" \ --route-table "$AZURE_ROUTE_TABLE" >/dev/null sync_all_routes "$ip" echo "$(date -Iseconds) synced routes to leader ${n} (${ip})" done + echo "$(date -Iseconds) watch disconnected, reconnecting in 5s..." + sleep 5 + done
76-82: Hardcoded Azure resource names reduce reusability.
AZURE_ROUTE_TABLE,AZURE_VNET_NAME,AZURE_SUBNET_NAME, andAZURE_ROUTESare all hardcoded for a specific environment (workers-serverscom,cozystack-vnet). Consider sourcing these from a ConfigMap or additional Secret refs so the manifest can be reused across environments without modification.
95-101: Silent error swallowing on route create may mask permission or config issues.The
|| trueon Line 97 is intended to handle "route already exists" but also suppresses auth errors, invalid resource group, etc. The subsequentupdate(Line 98-100) will catch some of these, but consider checking thecreateexit code or stderr for specific failure reasons, or usingaz network route-table route showfirst to determine if the route exists.
| spec: | ||
| serviceAccountName: kilo-azure-route-sync | ||
| containers: | ||
| - name: sync | ||
| image: mcr.microsoft.com/azure-cli:2.67.0 | ||
| imagePullPolicy: IfNotPresent | ||
| env: | ||
| - name: AZURE_CLIENT_ID | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | ||
| key: ClientID | ||
| - name: AZURE_CLIENT_SECRET | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | ||
| key: ClientSecret | ||
| - name: AZURE_TENANT_ID | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | ||
| key: TenantID | ||
| - name: AZURE_SUBSCRIPTION_ID | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | ||
| key: SubscriptionID | ||
| - name: AZURE_RESOURCE_GROUP | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | ||
| key: ResourceGroup | ||
| - name: AZURE_ROUTE_TABLE | ||
| value: kilo-routes-workers-serverscom | ||
| - name: AZURE_VNET_NAME | ||
| value: cozystack-vnet | ||
| - name: AZURE_SUBNET_NAME | ||
| value: workers-serverscom | ||
| - name: AZURE_ROUTES | ||
| value: to-serverscom=192.168.102.0/23 |
There was a problem hiding this comment.
Missing security context — container runs as root with writable filesystem.
Static analysis correctly flags multiple security posture gaps. At minimum, add a securityContext to the container:
🔒 Proposed security hardening
containers:
- name: sync
image: mcr.microsoft.com/azure-cli:2.67.0
imagePullPolicy: IfNotPresent
+ securityContext:
+ readOnlyRootFilesystem: false # az cli needs writable home; consider tmpfs
+ allowPrivilegeEscalation: false
+ runAsNonRoot: true
+ runAsUser: 65534
+ capabilities:
+ drop: ["ALL"]Note: readOnlyRootFilesystem: true may conflict with az aks install-cli writing to /usr/local/bin/ and az login caching tokens. If so, mount specific paths as emptyDir volumes. But at a minimum, allowPrivilegeEscalation: false and capabilities.drop: ALL should be set.
Also missing: resource requests/limits — without them, this pod can be OOMKilled or starve neighbors.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| spec: | |
| serviceAccountName: kilo-azure-route-sync | |
| containers: | |
| - name: sync | |
| image: mcr.microsoft.com/azure-cli:2.67.0 | |
| imagePullPolicy: IfNotPresent | |
| env: | |
| - name: AZURE_CLIENT_ID | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: ClientID | |
| - name: AZURE_CLIENT_SECRET | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: ClientSecret | |
| - name: AZURE_TENANT_ID | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: TenantID | |
| - name: AZURE_SUBSCRIPTION_ID | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: SubscriptionID | |
| - name: AZURE_RESOURCE_GROUP | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: ResourceGroup | |
| - name: AZURE_ROUTE_TABLE | |
| value: kilo-routes-workers-serverscom | |
| - name: AZURE_VNET_NAME | |
| value: cozystack-vnet | |
| - name: AZURE_SUBNET_NAME | |
| value: workers-serverscom | |
| - name: AZURE_ROUTES | |
| value: to-serverscom=192.168.102.0/23 | |
| spec: | |
| serviceAccountName: kilo-azure-route-sync | |
| containers: | |
| - name: sync | |
| image: mcr.microsoft.com/azure-cli:2.67.0 | |
| imagePullPolicy: IfNotPresent | |
| securityContext: | |
| readOnlyRootFilesystem: false # az cli needs writable home; consider tmpfs | |
| allowPrivilegeEscalation: false | |
| runAsNonRoot: true | |
| runAsUser: 65534 | |
| capabilities: | |
| drop: ["ALL"] | |
| env: | |
| - name: AZURE_CLIENT_ID | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: ClientID | |
| - name: AZURE_CLIENT_SECRET | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: ClientSecret | |
| - name: AZURE_TENANT_ID | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: TenantID | |
| - name: AZURE_SUBSCRIPTION_ID | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: SubscriptionID | |
| - name: AZURE_RESOURCE_GROUP | |
| valueFrom: | |
| secretKeyRef: | |
| name: cluster-autoscaler-azure-azure-cluster-autoscaler | |
| key: ResourceGroup | |
| - name: AZURE_ROUTE_TABLE | |
| value: kilo-routes-workers-serverscom | |
| - name: AZURE_VNET_NAME | |
| value: cozystack-vnet | |
| - name: AZURE_SUBNET_NAME | |
| value: workers-serverscom | |
| - name: AZURE_ROUTES | |
| value: to-serverscom=192.168.102.0/23 |
🧰 Tools
🪛 Trivy (0.69.1)
[error] 43-82: Root file system is not read-only
Container 'sync' of Deployment 'kilo-azure-route-sync' should set 'securityContext.readOnlyRootFilesystem' to true
Rule: KSV-0014
(IaC/Kubernetes)
[error] 43-82: Default security context configured
container kilo-azure-route-sync in cozy-cluster-autoscaler-azure namespace is using the default security context
Rule: KSV-0118
(IaC/Kubernetes)
[error] 40-82: Default security context configured
deployment kilo-azure-route-sync in cozy-cluster-autoscaler-azure namespace is using the default security context, which allows root privileges
Rule: KSV-0118
(IaC/Kubernetes)
🤖 Prompt for AI Agents
In `@manifests/kilo-azure-route-sync-deployment.yaml` around lines 43 - 82, The
container "sync" in the kilo-azure-route-sync Deployment runs as root with no
securityContext and no resource requests/limits; add a pod/container
securityContext for the "sync" container (referencing name: sync and image
mcr.microsoft.com/azure-cli:2.67.0) setting allowPrivilegeEscalation: false,
capabilities.drop: ["ALL"], runAsNonRoot: true and a non-zero runAsUser (or
runAsUser/runAsGroup at pod level), and optionally readOnlyRootFilesystem: true
— if you enable readOnlyRootFilesystem, mount emptyDir volumes for paths the
azure-cli writes to (eg /usr/local/bin and any az cache/token dir) and map them
into the container; also add reasonable resources.requests and resources.limits
for CPU/memory to avoid OOM/starvation.
| az login --service-principal -u "$AZURE_CLIENT_ID" -p "$AZURE_CLIENT_SECRET" --tenant "$AZURE_TENANT_ID" >/dev/null | ||
| az account set --subscription "$AZURE_SUBSCRIPTION_ID" | ||
|
|
||
| az aks install-cli --install-location /usr/local/bin/kubectl >/dev/null |
There was a problem hiding this comment.
Installing kubectl at runtime is fragile.
az aks install-cli downloads kubectl on every pod start, adding a network dependency and unpinned binary version. If the download endpoint is unavailable or rate-limited, the pod fails to start.
Consider building a custom image that bundles both az and kubectl at a pinned version, or use the bitnami/kubectl image with az installed, to make startup deterministic.
🤖 Prompt for AI Agents
In `@manifests/kilo-azure-route-sync-deployment.yaml` at line 89, The runtime call
"az aks install-cli --install-location /usr/local/bin/kubectl >/dev/null" is
fragile and must be replaced by using an image that already contains a pinned
kubectl and az; update the Deployment/container spec to use a custom image (or a
base like bitnami/kubectl with az installed) that bundles a specific kubectl
version and az CLI instead of installing at startup, or build and publish a
pinned-version image and reference that image in the container image field so
the startup no longer performs "az aks install-cli".
95836f7 to
150671c
Compare
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
150671c to
a9fedec
Compare
Set Route.Src for routes installed via the kilo interface to the node private IP when available.
Without an explicit source, the kernel may pick the WireGuard overlay address (for example
100.66.0.x). In environments like Azure SDN this can break return traffic because the overlay address is not routable by the underlay.Changes in this PR:
Srcon WireGuard-routednetlink.Routeentries inTopology.Routes;TestRoutesexpectations to assertSrcin logical/full topology scenarios.Summary by CodeRabbit
Improvements
Tests
New Features