Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions docs/blog/posts/dstack-sky.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,15 +121,14 @@ model: mixtral
```
</div>

If it has a `model` mapping, the model will be accessible
at `https://gateway.<project name>.sky.dstack.ai` via the OpenAI compatible interface.
The service endpoint will be accessible at `https://<run name>.<project name>.sky.dstack.ai` via the OpenAI compatible interface.

```python
from openai import OpenAI


client = OpenAI(
base_url="https://gateway.<project name>.sky.dstack.ai",
base_url="https://<run name>.<project name>.sky.dstack.ai/v1",
api_key="<dstack token>"
)

Expand Down
72 changes: 42 additions & 30 deletions docs/docs/concepts/services.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:

`dstack apply` automatically provisions instances and runs the service.

If a [gateway](gateways.md) is not configured, the service’s endpoint will be accessible at
If you do not have a [gateway](gateways.md) created, the service endpoint will be accessible at
`<dstack server URL>/proxy/services/<project name>/<run name>/`.

<div class="termy">
Expand All @@ -90,37 +90,50 @@ $ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \

</div>

If the service defines the [`model`](#model) property, the model can be accessed with
the global OpenAI-compatible endpoint at `<dstack server URL>/proxy/models/<project name>/`,
or via `dstack` UI.
<!-- If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with `Bearer <dstack token>`. -->

If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with
`Bearer <dstack token>`.
## Configuration options

??? info "Gateway"
Running services for development purposes doesn’t require setting up a [gateway](gateways.md).
<!-- !!! info "No commands"
If `commands` are not specified, `dstack` runs `image`’s entrypoint (or fails if none is set). -->

However, you'll need a gateway in the following cases:
### Gateway

* To use auto-scaling or rate limits
* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
* To enable HTTPS for the endpoint and map it to your domain
* If your service requires WebSockets
* If your service cannot work with a [path prefix](#path-prefix)
Here are cases where a service may need a gateway:

<!-- Note, if you're using [dstack Sky](https://sky.dstack.ai),
a gateway is already pre-configured for you. -->
* To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits)
* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
* To enable HTTPS for the endpoint and map it to your domain
* If your service requires WebSockets
* If your service cannot work with a [path prefix](#path-prefix)

If a [gateway](gateways.md) is configured, the service endpoint will be accessible at
`https://<run name>.<gateway domain>/`.
<!-- Note, if you're using [dstack Sky](https://sky.dstack.ai),
a gateway is already pre-configured for you. -->

If the service defines the `model` property, the model will be available via the global OpenAI-compatible endpoint
at `https://gateway.<gateway domain>/`.
If you want `dstack` to explicitly validate that a gateway is used, you can set the [`gateway`](../reference/dstack.yml/service.md#gateway) property in the service configuration to `true`. In this case, `dstack` will raise an error during `dstack apply` if a default gateway is not created.

## Configuration options
You can also set the `gateway` property to the name of a specific gateway, if required.

If you have a [gateway](gateways.md) created, the service endpoint will be accessible at `https://<run name>.<gateway domain>/`:

<div class="termy">

```shell
$ curl https://llama31.example.com/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer &lt;dstack token&gt;' \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}'
```

!!! info "No commands"
If `commands` are not specified, `dstack` runs `image`’s entrypoint (or fails if none is set).
</div>

### Replicas and scaling

Expand Down Expand Up @@ -215,12 +228,6 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
??? info "Disaggregated serving"
Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon.

### Model

If the service is running a chat model with an OpenAI-compatible interface,
set the [`model`](#model) property to make the model accessible via `dstack`'s
global OpenAI-compatible endpoint, and also accessible via `dstack`'s UI.

### Authorization

By default, the service enables authorization, meaning the service endpoint requires a `dstack` user token.
Expand Down Expand Up @@ -359,7 +366,7 @@ set [`strip_prefix`](../reference/dstack.yml/service.md#strip_prefix) to `false`
If your app cannot be configured to work with a path prefix, you can host it
on a dedicated domain name by setting up a [gateway](gateways.md).

### Rate limits { #rate-limits }
### Rate limits

If you have a [gateway](gateways.md), you can configure rate limits for your service
using the [`rate_limits`](../reference/dstack.yml/service.md#rate_limits) property.
Expand Down Expand Up @@ -408,6 +415,11 @@ Limits apply to the whole service (all replicas) and per client (by IP). Clients

</div>

### Model

If the service runs a model with an OpenAI-compatible interface, you can set the [`model`](#model) property to make the model accessible through `dstack`'s chat UI on the `Models` page.
In this case, `dstack` will use the service's `/v1/chat/completions` service.

### Resources

If you specify memory size, you can either specify an explicit size (e.g. `24GB`) or a
Expand Down
8 changes: 3 additions & 5 deletions examples/inference/nim/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,13 +78,12 @@ Provisioning...
```
</div>

If no gateway is created, the model will be available via the OpenAI-compatible endpoint
at `<dstack server URL>/proxy/models/<project name>/`.
If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.

<div class="termy">

```shell
$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill-deepseek/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer &lt;dstack token&gt;' \
-H 'Content-Type: application/json' \
Expand All @@ -106,8 +105,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \

</div>

When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway.<gateway domain>/`.
When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill-deepseek.<gateway domain>/`.

## Source code

Expand Down
13 changes: 6 additions & 7 deletions examples/inference/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B

```yaml
type: service
name: deepseek-r1-nvidia
name: deepseek-r1

image: lmsysorg/sglang:latest
env:
Expand All @@ -38,7 +38,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B

```yaml
type: service
name: deepseek-r1-amd
name: deepseek-r1

image: lmsysorg/sglang:v0.4.1.post4-rocm620
env:
Expand Down Expand Up @@ -69,20 +69,19 @@ $ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49

Submit the run deepseek-r1-amd? [y/n]: y
Submit the run deepseek-r1? [y/n]: y

Provisioning...
---> 100%
```
</div>

Once the service is up, the model will be available via the OpenAI-compatible endpoint
at `<dstack server URL>/proxy/models/<project name>/`.
If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.

<div class="termy">

```shell
curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer &lt;dstack token&gt;' \
-H 'Content-Type: application/json' \
Expand All @@ -107,7 +106,7 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
!!! info "SGLang Model Gateway"
If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#), create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.

> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the OpenAI-compatible endpoint is available at `https://gateway.<gateway domain>/`.
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.

## Source code

Expand Down
8 changes: 3 additions & 5 deletions examples/inference/tgi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,12 @@ Provisioning...
```
</div>

If no gateway is created, the model will be available via the OpenAI-compatible endpoint
at `<dstack server URL>/proxy/models/<project name>/`.
If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.

<div class="termy">

```shell
$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
$ curl http://127.0.0.1:3000/proxy/services/main/llama4-scout/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer &lt;dstack token&gt;' \
-H 'Content-Type: application/json' \
Expand All @@ -110,8 +109,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \

</div>

When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway.<gateway domain>/`.
When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama4-scout.<gateway domain>/`.

## Source code

Expand Down
8 changes: 3 additions & 5 deletions examples/inference/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,13 +330,12 @@ Provisioning...

## Access the endpoint

If no gateway is created, the model will be available via the OpenAI-compatible endpoint
at `<dstack server URL>/proxy/models/<project name>/`.
If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.

<div class="termy">

```shell
$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer &lt;dstack token&gt;' \
-H 'Content-Type: application/json' \
Expand All @@ -359,8 +358,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \

</div>

When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway.<gateway domain>/`.
When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill.<gateway domain>/`.

## Source code

Expand Down
8 changes: 3 additions & 5 deletions examples/inference/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,13 +78,12 @@ Provisioning...
```
</div>

If no gateway is created, the model will be available via the OpenAI-compatible endpoint
at `<dstack server URL>/proxy/models/<project name>/`.
If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.

<div class="termy">

```shell
$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer &lt;dstack token&gt;' \
-H 'Content-Type: application/json' \
Expand All @@ -106,8 +105,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \

</div>

When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway.<gateway domain>/`.
When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31.<gateway domain>/`.

## Source code

Expand Down
14 changes: 6 additions & 8 deletions examples/llms/deepseek/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`.

```yaml
type: service
name: deepseek-r1-nvidia
name: deepseek-r1

image: lmsysorg/sglang:latest
env:
Expand All @@ -203,7 +203,7 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`.

```yaml
type: service
name: deepseek-r1-nvidia
name: deepseek-r1

image: vllm/vllm-openai:latest
env:
Expand Down Expand Up @@ -255,20 +255,19 @@ $ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49

Submit the run deepseek-r1-amd? [y/n]: y
Submit the run deepseek-r1? [y/n]: y

Provisioning...
---> 100%
```
</div>

Once the service is up, the model will be available via the OpenAI-compatible endpoint
at `<dstack server URL>/proxy/models/<project name>/`.
If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.

<div class="termy">

```shell
curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer &lt;dstack token&gt;' \
-H 'Content-Type: application/json' \
Expand All @@ -290,8 +289,7 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
```
</div>

When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway.<gateway domain>/`.
When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.

## Fine-tuning

Expand Down
8 changes: 3 additions & 5 deletions examples/llms/llama31/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,13 +179,12 @@ Provisioning...

</div>

Once the service is up, the model will be available via the OpenAI-compatible endpoint
at `<dstack server URL>/proxy/models/<project name>/`.
If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.

<div class="termy">

```shell
$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer &lt;dstack token&gt;' \
-H 'Content-Type: application/json' \
Expand All @@ -207,8 +206,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \

</div>

When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway.<gateway domain>/`.
When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31.<gateway domain>/`.

[//]: # (TODO: How to prompting and tool calling)

Expand Down