Skip to content

feat(devx): local dev setup for control plane and full end-to-end flow (MLI-6681)#823

Open
lilyz-ai wants to merge 12 commits intomainfrom
lilyz-ai/mli-6681-control-plane-local-devx
Open

feat(devx): local dev setup for control plane and full end-to-end flow (MLI-6681)#823
lilyz-ai wants to merge 12 commits intomainfrom
lilyz-ai/mli-6681-control-plane-local-devx

Conversation

@lilyz-ai
Copy link
Copy Markdown
Collaborator

@lilyz-ai lilyz-ai commented May 7, 2026

Summary

Adds a complete local development workflow for model-engine so developers can iterate on both control plane code and the full endpoint lifecycle without cloud credentials or prod images.

Control-plane-only mode (make dev-server):

  • Spins up Postgres + Redis via docker-compose
  • LOCAL=true activates fake queue/docker/k8s implementations (mirrors CIRCLECI=true)
  • Full gateway API available at :5000 with auth skipped — no k8s cluster needed

Full end-to-end mode (make dev-server-full + make dev-service-builder + make dev-k8s-cacher):

  • make kind-up + make kind-image creates a local kind cluster and loads model-engine:local into it
  • Service Builder picks up endpoint creation tasks from local Redis and creates real k8s Deployments in kind
  • K8s Cacher polls kind and writes endpoint status back to Redis
  • Echo server (model-engine:local) used as the inference container — no GPU required

Code fixes included:

  • service_builder/celery.py + celery_task_queue_gateway.py: onprem cloud provider now uses redis Celery backend instead of s3 — without this, the Service Builder writes results to Redis but the Gateway looks in S3, leaving endpoints stuck in PENDING
  • dependencies.py: LOCAL=true + cloud_provider=onprem falls through to real OnPremQueueEndpointResourceDelegate instead of the fake
  • env_vars.py: GIT_TAG defaults to "local" when LOCAL=true so k8s templates reference the correct model-engine:local image

New files:

  • docker-compose.local.yml — Postgres 15 + Redis 7 with healthchecks and persistent volume
  • service_configs/service_config_local.yaml — HMI config for local services
  • model_engine_server/core/configs/local-full.yaml — onprem infra config for kind
  • Makefile — all dev targets in one place

Test plan

  • make dev-up && make dev-migrate && make dev-server — gateway starts, GET /v1/model-endpoints returns 200
  • make kind-up && make kind-image — kind cluster created, model-engine:local loaded
  • make dev-server-full + make dev-service-builder + make dev-k8s-cacher — all three processes start cleanly
  • POST a sync CPU endpoint with the echo server image → pod appears in kubectl --context kind-llm-engine get pods -n model-engine and endpoint transitions to READY
  • Existing unit tests pass: make test

Closes MLI-6681

🤖 Generated with Claude Code

Greptile Summary

  • Introduces a complete local dev workflow: make dev-server (control-plane-only with fake k8s/queue/docker) and make dev-server-full + make dev-service-builder + make dev-k8s-cacher (full end-to-end via kind), backed by Postgres 15 + Redis 7 via docker-compose.
  • Fixes the onprem Celery backend mismatch: both celery_task_queue_gateway.py and service_builder/celery.py now select redis as backend_protocol for onprem (previously s3), consistent with dependencies.py which already routed onprem to Redis task queues.
  • LOCAL=true now defaults GIT_TAG to \"local\" (avoiding the missing-tag error), routes the gateway to the real OnPremQueueEndpointResourceDelegate when cloud_provider=onprem, and uses Redis task-queue gateways instead of SQS.

Confidence Score: 5/5

Safe to merge — all previously flagged P1s are addressed and no new issues found.

All three prior P1 findings (cache_redis_aws_url, ML_INFRA_SERVICES_CONFIG_PATH not pinned, celery_task_queue_gateway.py backend mismatch) are resolved in the current diff. The backend_protocol fix is applied symmetrically in both the Gateway and Service Builder. local-full.yaml sets celery_broker_type_redis: true, keeping broker and backend consistent for the onprem/kind path. No new logic errors identified.

No files require special attention.

Important Files Changed

Filename Overview
model-engine/model_engine_server/infra/gateways/celery_task_queue_gateway.py Adds onprem to the redis backend_protocol condition, fixing the Gateway/Service Builder result-backend mismatch for onprem/local deployments.
model-engine/model_engine_server/service_builder/celery.py Mirrors the celery_task_queue_gateway.py fix — adds onprem to the redis backend_protocol tuple so Service Builder and Gateway agree on where Celery results are stored.
model-engine/model_engine_server/api/dependencies.py Two targeted changes: LOCAL && !onprem falls through to FakeQueueDelegate; LOCAL is added to the Redis task-queue branch so the control-plane server connects to local Redis instead of SQS.
model-engine/Makefile New dev-workflow Makefile; ML_INFRA_SERVICES_CONFIG_PATH is now explicitly pinned in both LOCAL_ENV and FULL_LOCAL_ENV, and dev-up delegates readiness polling to docker compose --wait which has its own timeout.
model-engine/service_configs/service_config_local.yaml Uses cache_redis_onprem_url (not cache_redis_aws_url), correctly bypassing cloud-provider assertions; dummy SQS/billing values are safe for local use.
model-engine/model_engine_server/core/configs/local-full.yaml New onprem/kind infra config; celery_broker_type_redis: true ensures Service Builder uses Redis broker, consistent with the gateway's Redis task-queue path for onprem.

Reviews (9): Last reviewed commit: "docs(devx): replace curl endpoint exampl..." | Re-trigger Greptile

lilyz-ai and others added 2 commits May 7, 2026 01:59
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a one-command local development workflow for the model engine control
plane so developers can iterate on gateway/service-builder code without
building prod images or touching live infra.

- docker-compose.local.yml: spins up Postgres 15 + Redis 7
- service_configs/service_config_local.yaml: HMI config for local services
- Makefile: dev-up / dev-migrate / dev-server / dev-down / test targets
- LOCAL=true env var now activates fake queue/docker implementations
  (parallel to existing CIRCLECI=true path) and skips GIT_TAG requirement
- README: new "Control Plane Local Setup" section with full walkthrough

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread model-engine/service_configs/service_config_local.yaml Outdated
Comment thread model-engine/Makefile Outdated
Comment thread model-engine/Makefile Outdated
…G_PATH

- service_config_local.yaml: switch from cache_redis_aws_url to
  cache_redis_onprem_url so the Redis URL is resolved before the
  cloud_provider assertion fires — fixes startup failure for non-AWS configs
- Makefile: pin ML_INFRA_SERVICES_CONFIG_PATH to default.yaml so local
  dev is not affected by a developer's ambient infra config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread model-engine/README.md
lilyz-ai and others added 2 commits May 7, 2026 02:32
- README: add ML_INFRA_SERVICES_CONFIG_PATH to the manual env-var snippet
  so developers with non-AWS ambient configs don't accidentally hit
  the cloud_provider assertion
- docker-compose.local.yml: mount a named volume for Postgres so the
  database survives dev-down/dev-up cycles

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the manual until-loops in dev-up with `docker compose up --wait`,
which blocks until healthchecks pass and exits non-zero if they fail —
eliminating the infinite-spin on container crash.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 7, 2026

@greptile review

@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 7, 2026

/greptile

lilyz-ai and others added 2 commits May 7, 2026 03:07
Extends the local dev setup so the complete control plane → Service Builder
→ k8s inference pod flow can be tested locally without cloud credentials.

Changes:
- local-full.yaml: new onprem infra config pointing to localhost Redis/kind
- dependencies.py: LOCAL=true + cloud_provider=onprem falls through to real
  Redis queue delegate instead of the fake (enabling full k8s flow)
- service_builder/celery.py: fix onprem to use redis backend not s3
- env_vars.py: default GIT_TAG to "local" when LOCAL=true so k8s templates
  reference the correct model-engine:local image loaded into kind
- Makefile: kind-up/kind-down/kind-image targets + dev-server-full,
  dev-service-builder, dev-k8s-cacher targets using FULL_LOCAL_ENV
- README: full end-to-end setup section with step-by-step instructions,
  example endpoint creation, and flow table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The gateway's module-level backend_protocol had the same aws/gcp/azure
mapping as service_builder/celery.py. Without this fix, the Service Builder
writes task results to Redis but the Gateway looks in S3, leaving endpoints
stuck in PENDING under the kind-based full local flow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 7, 2026

@greptile review

@lilyz-ai
Copy link
Copy Markdown
Collaborator Author

lilyz-ai commented May 7, 2026

/greptile

@lilyz-ai lilyz-ai changed the title feat(devx): local control plane dev setup (MLI-6681) feat(devx): local dev setup for control plane and full end-to-end flow (MLI-6681) May 7, 2026
lilyz-ai and others added 5 commits May 7, 2026 03:30
The exporter package was imported unconditionally under the OTEL_AVAILABLE
flag which only checked the base SDK, not the exporter. Include it in the
try block so OTEL_AVAILABLE stays False when the exporter is absent, fixing
the ImportError that caused run_unit_tests_server to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…chema gateway

- Reformat correlation.py and celery.py to satisfy black
- Move noqa comment to the from...import( line so ruff F401 is suppressed correctly
- Pass schema_generator=GenerateJsonSchema() (new required kwarg) to
  get_definitions() and get_openapi_path() in live_model_endpoints_schema_gateway,
  creating a fresh instance per route since pydantic rejects reuse

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oes not have this param

The param was added to fix a local test failure (FastAPI 0.110.0 requires it)
but FastAPI 0.135.1 (pinned in requirements.txt, used by CI) does not accept it,
causing mypy call-arg errors. Revert to the original signature.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xample

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant