Skip to content

Conversation

@fmoessbauer
Copy link

This series removes the A100 specific hard-coded settings and adds a parser to automatically detect all available partitions.
The parser does not depend on NVML but uses the output of nvidia-smi to extract the partition data.

We internally tested this on an A30 GPU.

@crystalzhaizhai
Copy link
Contributor

crystalzhaizhai commented Jul 18, 2022

Hi Felix,

Thank you for the suggestion!

GKE doesn't support all partition sizes. Only the hard-coded partition sizes are supported by GKE.

@fmoessbauer
Copy link
Author

Hi @crystalzhaizhai, we also plan to use this plugin independent of GKE, also with smaller Nvidia cards. This is currently not possible as just the A100 is supported. While we could fork the plugin, we believe that working together adds more value in the long run: More people fixing bugs, adding features and maintaining the plugin. An example would be #241 , which is superseded by this PR, as this one adds a generic solution that will work for all Nvidia cards.

Large OSS projects (like k8s) would have never gotten their popularity without the efforts of developers from various companies working together - independent of the features they need for "their" product.

@crystalzhaizhai
Copy link
Contributor

crystalzhaizhai commented Jul 25, 2022

Hi if you use device plugin independent of GKE, then feel free to make your own version. Nvidia also provides their own device plugin with a more generic solution to this problem, which you can use as another reference. The reason GKE makes its own specific device plugin and doesn't use Nvidia's device plugin is that GKE has its own requirements.

Thank you for the efforts! I appreciate the value from external contributors and will discuss with my team how we could formulate the process so that design options can be sufficiently discussed.

@crystalzhaizhai
Copy link
Contributor

crystalzhaizhai commented Jul 25, 2022

I brought your solution to my team and we agree the direction you suggested can make the device plugin more scalable. We do have concerns whether nvidia-smi output will keep in the form which regexp hard-coded. This PR use the output of nvidia-smi to parse the partition data (use regexp), how confident we are for nvidia-smi output will keep in the form which regexp hard-coded? As the output is big, every small change in the output may cause the parser stop working. We are very happy to see external contributors. Let's stay in loop about how we can move from what you have right now.

@fmoessbauer
Copy link
Author

fmoessbauer commented Jul 26, 2022

I brought your solution to my team and we agree the direction you suggested can make the device plugin more scalable.

Glad to hear that.

how confident we are for nvidia-smi output will keep in the form which regexp hard-coded

We cannot really be confident. It would be better to use NVML to replace all the interfacing with nvidia-smi. It also looks like the official NVML bindings from Nvidia now are in good shape. We already use them internally at Siemens (to query which values to add to /etc/nvidia/gpu_config.json). By that, the required code already has been written.

When starting with the parser I had a look at #52 which proposed to re-write the plugin for NVML, but never has been finished. That's why I decided to parse the output. But I'm willing to re-write at least the parser part using NVML. Then later PRs can refactor other parts of the plugin. IMO this is better than having a large refactoring PR.

I'm happy to hear your thoughts.

This patch adds a parser to parse the output of
nvidia-smi mig -lgip to get the possible GPU partitions.

Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
This patch replaces the static discovery and mapping of
GPU profiles (and sizes) by a dynamic discovery.

By that, the plugin supports any partitionable GPUs.

Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
This patch removes some sanity checks from nvidia_gpu that use hard-coded partition sizes.
By that, we make the plugin compatible with other NVIDIA cards like the
A30.

Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
@fmoessbauer fmoessbauer force-pushed the fm/parse-gpu-instances-v2 branch from 3c64c42 to 1a9427b Compare September 15, 2022 11:36
@fmoessbauer
Copy link
Author

Replaced by #250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants