Add support for non A100 GPUs #242

fmoessbauer · 2022-07-14T09:37:05Z

This series removes the A100 specific hard-coded settings and adds a parser to automatically detect all available partitions.
The parser does not depend on NVML but uses the output of nvidia-smi to extract the partition data.

We internally tested this on an A30 GPU.

crystalzhaizhai · 2022-07-18T18:56:39Z

Hi Felix,

Thank you for the suggestion!

GKE doesn't support all partition sizes. Only the hard-coded partition sizes are supported by GKE.

fmoessbauer · 2022-07-19T07:27:21Z

Hi @crystalzhaizhai, we also plan to use this plugin independent of GKE, also with smaller Nvidia cards. This is currently not possible as just the A100 is supported. While we could fork the plugin, we believe that working together adds more value in the long run: More people fixing bugs, adding features and maintaining the plugin. An example would be #241 , which is superseded by this PR, as this one adds a generic solution that will work for all Nvidia cards.

Large OSS projects (like k8s) would have never gotten their popularity without the efforts of developers from various companies working together - independent of the features they need for "their" product.

crystalzhaizhai · 2022-07-25T18:07:47Z

Hi if you use device plugin independent of GKE, then feel free to make your own version. Nvidia also provides their own device plugin with a more generic solution to this problem, which you can use as another reference. The reason GKE makes its own specific device plugin and doesn't use Nvidia's device plugin is that GKE has its own requirements.

Thank you for the efforts! I appreciate the value from external contributors and will discuss with my team how we could formulate the process so that design options can be sufficiently discussed.

crystalzhaizhai · 2022-07-25T20:20:53Z

I brought your solution to my team and we agree the direction you suggested can make the device plugin more scalable. We do have concerns whether nvidia-smi output will keep in the form which regexp hard-coded. This PR use the output of nvidia-smi to parse the partition data (use regexp), how confident we are for nvidia-smi output will keep in the form which regexp hard-coded? As the output is big, every small change in the output may cause the parser stop working. We are very happy to see external contributors. Let's stay in loop about how we can move from what you have right now.

fmoessbauer · 2022-07-26T10:39:35Z

I brought your solution to my team and we agree the direction you suggested can make the device plugin more scalable.

Glad to hear that.

how confident we are for nvidia-smi output will keep in the form which regexp hard-coded

We cannot really be confident. It would be better to use NVML to replace all the interfacing with nvidia-smi. It also looks like the official NVML bindings from Nvidia now are in good shape. We already use them internally at Siemens (to query which values to add to /etc/nvidia/gpu_config.json). By that, the required code already has been written.

When starting with the parser I had a look at #52 which proposed to re-write the plugin for NVML, but never has been finished. That's why I decided to parse the output. But I'm willing to re-write at least the parser part using NVML. Then later PRs can refactor other parts of the plugin. IMO this is better than having a large refactoring PR.

I'm happy to hear your thoughts.

This patch adds a parser to parse the output of nvidia-smi mig -lgip to get the possible GPU partitions. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>

This patch replaces the static discovery and mapping of GPU profiles (and sizes) by a dynamic discovery. By that, the plugin supports any partitionable GPUs. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>

This patch removes some sanity checks from nvidia_gpu that use hard-coded partition sizes. By that, we make the plugin compatible with other NVIDIA cards like the A30. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>

fmoessbauer · 2022-09-15T11:41:32Z

Replaced by #250

fmoessbauer mentioned this pull request Jul 15, 2022

Add new mig partition for the new accelerator type #241

Merged

fmoessbauer added 3 commits September 14, 2022 17:23

add parser to discover available GPU partitions

c851475

This patch adds a parser to parse the output of nvidia-smi mig -lgip to get the possible GPU partitions. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>

dynamically discover supported profiles of GPU

859b5f7

This patch replaces the static discovery and mapping of GPU profiles (and sizes) by a dynamic discovery. By that, the plugin supports any partitionable GPUs. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>

remove partitioning checks only working for A100

1a9427b

This patch removes some sanity checks from nvidia_gpu that use hard-coded partition sizes. By that, we make the plugin compatible with other NVIDIA cards like the A30. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>

fmoessbauer force-pushed the fm/parse-gpu-instances-v2 branch from 3c64c42 to 1a9427b Compare September 15, 2022 11:36

fmoessbauer closed this Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for non A100 GPUs #242

Add support for non A100 GPUs #242

Uh oh!

fmoessbauer commented Jul 14, 2022

Uh oh!

crystalzhaizhai commented Jul 18, 2022 •

edited

Loading

Uh oh!

fmoessbauer commented Jul 19, 2022

Uh oh!

crystalzhaizhai commented Jul 25, 2022 •

edited

Loading

Uh oh!

crystalzhaizhai commented Jul 25, 2022 •

edited

Loading

Uh oh!

fmoessbauer commented Jul 26, 2022 •

edited

Loading

Uh oh!

fmoessbauer commented Sep 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for non A100 GPUs #242

Add support for non A100 GPUs #242

Uh oh!

Conversation

fmoessbauer commented Jul 14, 2022

Uh oh!

crystalzhaizhai commented Jul 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmoessbauer commented Jul 19, 2022

Uh oh!

crystalzhaizhai commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crystalzhaizhai commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmoessbauer commented Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmoessbauer commented Sep 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crystalzhaizhai commented Jul 18, 2022 •

edited

Loading

crystalzhaizhai commented Jul 25, 2022 •

edited

Loading

crystalzhaizhai commented Jul 25, 2022 •

edited

Loading

fmoessbauer commented Jul 26, 2022 •

edited

Loading