Simple Python server that serves does inference using a GPU. Used as a demo on my NVIDIA DGX Spark (GB10 Grace Blackwell Superchip). I pulled my models from Hugging Face and ran them locally. In theory, I could drop in different models at will.
When using this one GPU, the inference is 32x faster. On CPU inference took 30 minutes. Using my GB10 it was done in 55 seconds. This is a true testament as to why GPUs are so highly sought after. This isn't the first time that I've made ML work faster for someone but it's exciting to have this type of technology sitting on my desktop versus in the cloud. It's also a lot cheaper, becasue if I forget to turn this GPU off, I won't rack up a hefty bill.
The Github repo is a mirror of the Gitlab repo where I actually develop.
$ docker build -f Dockerfile.dev -t image-server-dev:latest .
$ docker run --privileged --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm --ipc=host -p 8110:80 -v ${PWD}/models:/app/models -v ${PWD}/app:/app/app -w /app image-server-dev:latest
$ docker build --no-cache -t image-server:latest .
$ docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -v "$PWD/models:/app/models" -p 8110:80 image-server:latest
$ curl -X POST http://127.0.0.1:8110/generate -u <some_user>:<some_secret> -H "Content-Type: application/json" -d '{"prompt":"cinematic neon street, rain reflections"}' --output image.png
# look into conatiner
$ docker exec -it <container_id> bash
# look at logs
$ docker logs <container_id>
$ ctrl + c # from container running terminal of course
$ docker ps
$ docker rm -f <container_id> # if necessary
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' # cleanup any background tasks (must do)
This app uses a models/ dir that is ignored by git. You should have
models/ locally in the root of this application (on the same level)
as app/.
We should not use uv to develop for 2 reasons:
-
Dependence on Nvidia’s custom PyTorch image
-
You need
nvcr.io/nvidia/pytorch:25.09-py3because the GB10 GPU is new and only supported through Nvidia’s specialized PyTorch build. -
Using any other environment (like one built by
uvor a vanilla Python base image) will be missing the custom CUDA/PyTorch needed to run on GPUs.
-
-
uvcreates an isolated virtual environment-
uvautomatically builds a.venvthat’s isolated from the system-level Python in the container. -
You can’t easily install the Nvidia-specific PyTorch build into that
.venvsince it depends on system-level libraries and dependencies already preinstalled in the base image. -
This breaks GPU access, because the
.venv’s packages don’t link correctly to those CUDA libs.
-
While SSH'd into your machine run the following:
$ ifconfig | grep "inet "
# copy the one that starts with "192."
Open a new terminal session on your local machine (localhost), perform
a scp:
$ scp [username]@[ip-address-from-above]:/path/to/file/ /path/to/destination
You should see your image on your local machine.