diff --git a/docs/blog/posts/dstack-sky.md b/docs/blog/posts/dstack-sky.md index 78d35641c0..2665e8022b 100644 --- a/docs/blog/posts/dstack-sky.md +++ b/docs/blog/posts/dstack-sky.md @@ -121,15 +121,14 @@ model: mixtral ``` -If it has a `model` mapping, the model will be accessible -at `https://gateway..sky.dstack.ai` via the OpenAI compatible interface. +The service endpoint will be accessible at `https://..sky.dstack.ai` via the OpenAI compatible interface. ```python from openai import OpenAI client = OpenAI( - base_url="https://gateway..sky.dstack.ai", + base_url="https://..sky.dstack.ai/v1", api_key="" ) diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md index 0f6bf07bb8..e582c4459f 100644 --- a/docs/docs/concepts/services.md +++ b/docs/docs/concepts/services.md @@ -68,7 +68,7 @@ Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at: `dstack apply` automatically provisions instances and runs the service. -If a [gateway](gateways.md) is not configured, the service’s endpoint will be accessible at +If you do not have a [gateway](gateways.md) created, the service endpoint will be accessible at `/proxy/services///`.
@@ -90,37 +90,50 @@ $ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
-If the service defines the [`model`](#model) property, the model can be accessed with -the global OpenAI-compatible endpoint at `/proxy/models//`, -or via `dstack` UI. +If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with `Bearer `. -If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with -`Bearer `. +## Configuration options -??? info "Gateway" - Running services for development purposes doesn’t require setting up a [gateway](gateways.md). + - However, you'll need a gateway in the following cases: +### Gateway - * To use auto-scaling or rate limits - * To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#) - * To enable HTTPS for the endpoint and map it to your domain - * If your service requires WebSockets - * If your service cannot work with a [path prefix](#path-prefix) +Here are cases where a service may need a [gateway](gateways.md): - +* To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits) +* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#) +* To enable HTTPS for the endpoint and map it to your domain +* If your service requires WebSockets +* If your service cannot work with a [path prefix](#path-prefix) - If a [gateway](gateways.md) is configured, the service endpoint will be accessible at - `https://./`. + - If the service defines the `model` property, the model will be available via the global OpenAI-compatible endpoint - at `https://gateway./`. +If you want `dstack` to explicitly validate that a gateway is used, you can set the [`gateway`](../reference/dstack.yml/service.md#gateway) property in the service configuration to `true`. In this case, `dstack` will raise an error during `dstack apply` if a default gateway is not created. -## Configuration options +You can also set the `gateway` property to the name of a specific gateway, if required. + +If you have a [gateway](gateways.md) created, the service endpoint will be accessible at `https://./`: + +
+ +```shell +$ curl https://llama31.example.com/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -H 'Authorization: Bearer <dstack token>' \ + -d '{ + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "messages": [ + { + "role": "user", + "content": "Compose a poem that explains the concept of recursion in programming." + } + ] + }' +``` -!!! info "No commands" - If `commands` are not specified, `dstack` runs `image`’s entrypoint (or fails if none is set). +
### Replicas and scaling @@ -215,12 +228,6 @@ Setting the minimum number of replicas to `0` allows the service to scale down t ??? info "Disaggregated serving" Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon. -### Model - -If the service is running a chat model with an OpenAI-compatible interface, -set the [`model`](#model) property to make the model accessible via `dstack`'s -global OpenAI-compatible endpoint, and also accessible via `dstack`'s UI. - ### Authorization By default, the service enables authorization, meaning the service endpoint requires a `dstack` user token. @@ -359,7 +366,7 @@ set [`strip_prefix`](../reference/dstack.yml/service.md#strip_prefix) to `false` If your app cannot be configured to work with a path prefix, you can host it on a dedicated domain name by setting up a [gateway](gateways.md). -### Rate limits { #rate-limits } +### Rate limits If you have a [gateway](gateways.md), you can configure rate limits for your service using the [`rate_limits`](../reference/dstack.yml/service.md#rate_limits) property. @@ -408,6 +415,11 @@ Limits apply to the whole service (all replicas) and per client (by IP). Clients +### Model + +If the service runs a model with an OpenAI-compatible interface, you can set the [`model`](#model) property to make the model accessible through `dstack`'s chat UI on the `Models` page. +In this case, `dstack` will use the service's `/v1/chat/completions` service. + ### Resources If you specify memory size, you can either specify an explicit size (e.g. `24GB`) or a diff --git a/examples/accelerators/tenstorrent/README.md b/examples/accelerators/tenstorrent/README.md index bbb0b2207e..78475fb52e 100644 --- a/examples/accelerators/tenstorrent/README.md +++ b/examples/accelerators/tenstorrent/README.md @@ -97,7 +97,7 @@ at `/proxy/services///`.
```shell -$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ +$ curl http://127.0.0.1:3000/proxy/services/main/tt-inference-server/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ diff --git a/examples/inference/nim/README.md b/examples/inference/nim/README.md index 1b125dda83..cae9880e56 100644 --- a/examples/inference/nim/README.md +++ b/examples/inference/nim/README.md @@ -78,13 +78,12 @@ Provisioning... ```
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint -at `/proxy/models//`. +If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell -$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ +$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill-deepseek/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ @@ -106,8 +105,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint -is available at `https://gateway./`. +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill-deepseek./`. ## Source code diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md index d0c42bf2ee..8a38944b1f 100644 --- a/examples/inference/sglang/README.md +++ b/examples/inference/sglang/README.md @@ -12,7 +12,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B ```yaml type: service - name: deepseek-r1-nvidia + name: deepseek-r1 image: lmsysorg/sglang:latest env: @@ -38,7 +38,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B ```yaml type: service - name: deepseek-r1-amd + name: deepseek-r1 image: lmsysorg/sglang:v0.4.1.post4-rocm620 env: @@ -69,20 +69,19 @@ $ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml # BACKEND REGION RESOURCES SPOT PRICE 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49 -Submit the run deepseek-r1-amd? [y/n]: y +Submit the run deepseek-r1? [y/n]: y Provisioning... ---> 100% ``` -Once the service is up, the model will be available via the OpenAI-compatible endpoint -at `/proxy/models//`. +If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell -curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ +curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ @@ -107,7 +106,7 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ !!! info "SGLang Model Gateway" If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#), create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details. -> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the OpenAI-compatible endpoint is available at `https://gateway./`. +> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the service endpoint will be available at `https://deepseek-r1./`. ## Source code diff --git a/examples/inference/tgi/README.md b/examples/inference/tgi/README.md index 6984ec2ff6..7090780b27 100644 --- a/examples/inference/tgi/README.md +++ b/examples/inference/tgi/README.md @@ -82,13 +82,12 @@ Provisioning... ```
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint -at `/proxy/models//`. +If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell -$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ +$ curl http://127.0.0.1:3000/proxy/services/main/llama4-scout/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ @@ -110,8 +109,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint -is available at `https://gateway./`. +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama4-scout./`. ## Source code diff --git a/examples/inference/trtllm/README.md b/examples/inference/trtllm/README.md index 02ae55f464..5d1e030288 100644 --- a/examples/inference/trtllm/README.md +++ b/examples/inference/trtllm/README.md @@ -330,13 +330,12 @@ Provisioning... ## Access the endpoint -If no gateway is created, the model will be available via the OpenAI-compatible endpoint -at `/proxy/models//`. +If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell -$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ +$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ @@ -359,8 +358,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint -is available at `https://gateway./`. +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill./`. ## Source code diff --git a/examples/inference/vllm/README.md b/examples/inference/vllm/README.md index 12bfd18ac1..d315618b94 100644 --- a/examples/inference/vllm/README.md +++ b/examples/inference/vllm/README.md @@ -78,13 +78,12 @@ Provisioning... ``` -If no gateway is created, the model will be available via the OpenAI-compatible endpoint -at `/proxy/models//`. +If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell -$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ +$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ @@ -106,8 +105,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint -is available at `https://gateway./`. +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31./`. ## Source code diff --git a/examples/llms/deepseek/README.md b/examples/llms/deepseek/README.md index bca90ac35c..41d73e9e99 100644 --- a/examples/llms/deepseek/README.md +++ b/examples/llms/deepseek/README.md @@ -179,7 +179,7 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`. ```yaml type: service - name: deepseek-r1-nvidia + name: deepseek-r1 image: lmsysorg/sglang:latest env: @@ -203,7 +203,7 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`. ```yaml type: service - name: deepseek-r1-nvidia + name: deepseek-r1 image: vllm/vllm-openai:latest env: @@ -255,20 +255,19 @@ $ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml # BACKEND REGION RESOURCES SPOT PRICE 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49 -Submit the run deepseek-r1-amd? [y/n]: y +Submit the run deepseek-r1? [y/n]: y Provisioning... ---> 100% ``` -Once the service is up, the model will be available via the OpenAI-compatible endpoint -at `/proxy/models//`. +If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell -curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ +curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ @@ -290,8 +289,7 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ ```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint -is available at `https://gateway./`. +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://deepseek-r1./`. ## Fine-tuning diff --git a/examples/llms/llama/README.md b/examples/llms/llama/README.md index 573d4b8037..3f2b8ab54d 100644 --- a/examples/llms/llama/README.md +++ b/examples/llms/llama/README.md @@ -171,7 +171,7 @@ at `/proxy/services///`.
```shell -curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ +curl http://127.0.0.1:3000/proxy/services/main/llama4-scout/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ diff --git a/examples/llms/llama31/README.md b/examples/llms/llama31/README.md index ff288e3c87..b99362cde0 100644 --- a/examples/llms/llama31/README.md +++ b/examples/llms/llama31/README.md @@ -179,13 +179,12 @@ Provisioning...
-Once the service is up, the model will be available via the OpenAI-compatible endpoint -at `/proxy/models//`. +If no gateway is created, the service endpoint will be available at `/proxy/services///`.
```shell -$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ +$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \ -X POST \ -H 'Authorization: Bearer <dstack token>' \ -H 'Content-Type: application/json' \ @@ -207,8 +206,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint -is available at `https://gateway./`. +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31./`. [//]: # (TODO: How to prompting and tool calling)