@@ -19,12 +19,16 @@ When the SDK is pointed at the cloud API, calling this method should raise an er
1919| Waiting strategy | SDK polls ` GET /status/metrics.json ` for detector readiness |
2020| Config persistence | Runtime config written to YAML file on shared PVC; read on startup by all containers |
2121
22- ## Implementation
22+ ---
23+
24+ ## Implemented
2325
2426### Edge Endpoint: ` POST /device-api/v1/edge/configure `
2527
2628** File** : ` app/api/routes/edge_config.py ` , registered in ` app/api/api.py `
2729
30+ Currently supports ** merge mode only** : incoming config is merged into the existing config.
31+
2832On receiving a config POST, the handler:
29331 . Merges incoming JSON into the existing ` RootEdgeConfig ` (global_config, edge_inference_configs, detectors)
30342 . Validates the merged config via Pydantic (returns 400 on invalid config)
@@ -68,7 +72,7 @@ of which PVC subdirectory it has mounted.
6872 every 5 seconds until all configured detectors have ` status: "ready" ` , or raises on timeout
6973- Direct ` requests.post() ` call, not OpenAPI-generated
7074
71- ### End-to-End Flow
75+ ### End-to-End Flow (merge mode)
7276
73771 . SDK POSTs config to edge endpoint
74782 . Edge endpoint merges incoming config with existing config
@@ -81,7 +85,138 @@ of which PVC subdirectory it has mounted.
81859 . SDK polls ` /status/metrics.json ` -- status monitor reads runtime config from PVC and reports pod readiness via Kubernetes API
828610 . All detectors show ` status: "ready" ` -> SDK returns
8387
84- ### Known Limitations
88+ ### Cloud vs. Edge Detection
89+
90+ | Scenario | What happens |
91+ | ---| ---|
92+ | SDK -> cloud (` api.groundlight.ai ` ) | Cloud returns 404 -> SDK raises error |
93+ | SDK -> edge endpoint (with new route) | FastAPI handles it -> returns 200 |
94+ | SDK -> old edge endpoint (without new route) | FastAPI returns 404 -> nginx falls back to cloud -> 404 -> SDK raises error |
95+
96+ ---
97+
98+ ## Planned: Replace Mode
99+
100+ ### Problem
101+
102+ The current merge mode can only ** add or update** detectors. It cannot remove detectors
103+ that are no longer desired. If a user wants to go from 5 detectors to 3, the old 2
104+ detectors' inference pods keep running and wasting resources.
105+
106+ ### Proposed SDK Change
107+
108+ Add a ` replace ` parameter to ` configure_edge() ` :
109+
110+ - ` replace=False ` (default): Current merge behavior. Incoming config is merged into existing.
111+ - ` replace=True ` : Incoming config fully replaces the existing config. Detectors not in the
112+ new config are removed (their inference pods are deleted).
113+
114+ ### Required Edge Endpoint Changes
115+
116+ The edge endpoint currently has ** no code to delete** Kubernetes Deployments, Services,
117+ database records, or model files for detectors. All of this needs to be added.
118+
119+ ** Existing infrastructure we can use:**
120+ - ` get_edge_inference_deployment_name(detector_id) ` and ` get_edge_inference_service_name(detector_id) `
121+ in ` app/core/edge_inference.py ` map detector IDs to K8s resource names.
122+ - The service account (` edge-endpoint-service-account ` ) already has RBAC permissions to
123+ ` delete ` both ` deployments ` and ` services ` . These permissions are currently unused.
124+ - ` InferenceDeploymentManager ` in ` app/core/kubernetes_management.py ` already has a K8s
125+ client and namespace context. Adding a ` delete_inference_deployment() ` method here is natural.
126+ - ` get_detector_models_dir(repository_root, detector_id) ` returns the model directory
127+ (` {MODEL_REPOSITORY_PATH}/{detector_id}/ ` ). Deleting this directory removes all model
128+ files (primary + oodd + all versions).
129+
130+ ** What needs to be built:**
131+
132+ 1 . ** ` InferenceDeploymentManager.delete_inference_deployment(detector_id, is_oodd) ` **
133+ (` app/core/kubernetes_management.py ` )
134+ - Call ` delete_namespaced_deployment() ` to remove the inference Deployment
135+ - Call ` delete_namespaced_service() ` to remove the inference Service
136+ - The naming functions already exist to map detector ID -> resource names
137+
138+ 2 . ** ` DatabaseManager.delete_inference_deployment_record(model_name) ` **
139+ (` app/core/database.py ` )
140+ - Delete the DB record so the model updater doesn't recreate the deployment on its
141+ next refresh cycle
142+
143+ 3 . ** Model file cleanup**
144+ - Delete ` {MODEL_REPOSITORY_PATH}/{detector_id}/ ` (the entire detector directory,
145+ which contains ` primary/ ` and ` oodd/ ` subdirs with versioned model files)
146+ - Can use ` shutil.rmtree() ` (already used by ` delete_model_version() ` )
147+
148+ 4 . ** Replace logic in ` edge_config.py ` handler**
149+ - Accept a ` replace ` flag in the POST body
150+ - If ` replace=True ` : use the incoming config as-is instead of merging
151+ - Diff old vs new detector sets: ` removed = old_detector_ids - new_detector_ids `
152+ - ** Deletion must complete before new pods roll out.** The edge endpoint must wait
153+ for removed detector pods to fully terminate (not just in ` Terminating ` state)
154+ before writing DB records for new detectors. This prevents OOM from old and new
155+ pods competing for the same finite GPU/memory resources.
156+ - For each removed detector:
157+ a. Call ` InferenceDeploymentManager.delete_inference_deployment() ` (primary + oodd)
158+ b. Poll until the pods are fully gone (not just Terminating)
159+ c. Call ` DatabaseManager.delete_inference_deployment_record() ` (primary + oodd)
160+ d. Delete model files from disk
161+ e. Clean up ` EdgeInferenceManager ` state (inference_client_urls, oodd URLs,
162+ escalation tracking)
163+ - After all deletions complete: proceed with new/retained detectors (same flow as
164+ current merge mode)
165+ - ` configure_edge(detectors=[], replace=True) ` is valid and removes all detector pods.
166+
167+ 5 . ** SDK changes**
168+ - Add ` replace: bool = False ` parameter to ` configure_edge() `
169+ - Pass ` replace ` flag in POST body
170+ - When ` replace=True ` and ` wait > 0 ` : wait for removed pods to terminate AND for
171+ new/retained pods to become ready
172+
173+ ### Ordering Guarantee
174+
175+ When ` replace=True ` , the edge endpoint enforces this sequence:
176+
177+ ```
178+ 1. Delete removed detector deployments/services
179+ 2. Wait for removed pods to fully terminate
180+ 3. Clean up DB records + model files for removed detectors
181+ 4. Write DB records for new detectors
182+ 5. Model updater picks up new detectors and creates deployments
183+ ```
184+
185+ This ensures old pods release their resources before new pods are scheduled,
186+ preventing resource exhaustion on memory/GPU-constrained edge devices.
187+
188+ ### Decisions
189+
190+ - ** Async** : The POST handler returns immediately. Deletion and re-creation happen
191+ in a FastAPI background task. The SDK polls ` /status/metrics.json ` for completion.
192+ - ** Termination time** : Inference pods use the K8s default of 30s
193+ ` terminationGracePeriodSeconds ` . No custom value is set.
194+ - ** Partial failure** : Error out. If deletion of any detector fails, the background
195+ task logs the error and stops. The config is left in a partially cleaned state;
196+ the user can retry.
197+
198+ ### Implementation Details
199+
200+ ** Edge endpoint needs an ` InferenceDeploymentManager ` on ` AppState ` .**
201+ Currently only the model-updater container creates one. The edge-endpoint container
202+ has the K8s service account and RBAC permissions but doesn't use them. We add
203+ an ` InferenceDeploymentManager ` to ` AppState ` , guarded by the existing
204+ ` DEPLOY_DETECTOR_LEVEL_INFERENCE ` env var (only set in K8s, not Docker tests).
205+
206+ ** Background task flow** (runs after POST returns):
207+ 1 . Delete K8s Deployments + Services for each removed detector
208+ 2 . Poll until pods are fully terminated (not just Terminating)
209+ 3 . Delete DB records for removed detectors
210+ 4 . Delete model files from PVC (` shutil.rmtree({MODEL_REPOSITORY_PATH}/{detector_id}/ ` )
211+ 5 . Write DB records for new detectors (model updater picks these up)
212+
213+ ** SDK polling** : When ` replace=True ` and ` wait > 0 ` , the SDK waits until:
214+ - Removed detector IDs no longer appear in ` /status/metrics.json `
215+ - New/retained detector IDs all show ` status: "ready" `
216+
217+ ---
218+
219+ ## Known Limitations
85220
86221- ** Multiprocess in-memory state** : Edge endpoint runs multiple uvicorn workers. The in-memory
87222 config update only applies to the worker that handles the POST. Other workers retain stale
@@ -92,14 +227,6 @@ of which PVC subdirectory it has mounted.
92227- ** File write race** : If two concurrent config POSTs hit different workers, the last write wins.
93228 This is acceptable for now; atomic file writes or a lock file can be added later if needed.
94229
95- ### Cloud vs. Edge Detection
96-
97- | Scenario | What happens |
98- | ---| ---|
99- | SDK -> cloud (` api.groundlight.ai ` ) | Cloud returns 404 -> SDK raises error |
100- | SDK -> edge endpoint (with new route) | FastAPI handles it -> returns 200 |
101- | SDK -> old edge endpoint (without new route) | FastAPI returns 404 -> nginx falls back to cloud -> 404 -> SDK raises error |
102-
103230## Future Work
104231
105232- Define proper Pydantic models in the SDK for config validation
0 commit comments