-
Notifications
You must be signed in to change notification settings - Fork 176
Add support for non A100 GPUs #242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for non A100 GPUs #242
Conversation
|
Hi Felix, Thank you for the suggestion! GKE doesn't support all partition sizes. Only the hard-coded partition sizes are supported by GKE. |
|
Hi @crystalzhaizhai, we also plan to use this plugin independent of GKE, also with smaller Nvidia cards. This is currently not possible as just the A100 is supported. While we could fork the plugin, we believe that working together adds more value in the long run: More people fixing bugs, adding features and maintaining the plugin. An example would be #241 , which is superseded by this PR, as this one adds a generic solution that will work for all Nvidia cards. Large OSS projects (like k8s) would have never gotten their popularity without the efforts of developers from various companies working together - independent of the features they need for "their" product. |
|
Hi if you use device plugin independent of GKE, then feel free to make your own version. Nvidia also provides their own device plugin with a more generic solution to this problem, which you can use as another reference. The reason GKE makes its own specific device plugin and doesn't use Nvidia's device plugin is that GKE has its own requirements. Thank you for the efforts! I appreciate the value from external contributors and will discuss with my team how we could formulate the process so that design options can be sufficiently discussed. |
|
I brought your solution to my team and we agree the direction you suggested can make the device plugin more scalable. We do have concerns whether nvidia-smi output will keep in the form which regexp hard-coded. This PR use the output of nvidia-smi to parse the partition data (use regexp), how confident we are for nvidia-smi output will keep in the form which regexp hard-coded? As the output is big, every small change in the output may cause the parser stop working. We are very happy to see external contributors. Let's stay in loop about how we can move from what you have right now. |
Glad to hear that.
We cannot really be confident. It would be better to use NVML to replace all the interfacing with nvidia-smi. It also looks like the official NVML bindings from Nvidia now are in good shape. We already use them internally at Siemens (to query which values to add to When starting with the parser I had a look at #52 which proposed to re-write the plugin for NVML, but never has been finished. That's why I decided to parse the output. But I'm willing to re-write at least the parser part using NVML. Then later PRs can refactor other parts of the plugin. IMO this is better than having a large refactoring PR. I'm happy to hear your thoughts. |
This patch adds a parser to parse the output of nvidia-smi mig -lgip to get the possible GPU partitions. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
This patch replaces the static discovery and mapping of GPU profiles (and sizes) by a dynamic discovery. By that, the plugin supports any partitionable GPUs. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
This patch removes some sanity checks from nvidia_gpu that use hard-coded partition sizes. By that, we make the plugin compatible with other NVIDIA cards like the A30. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
3c64c42 to
1a9427b
Compare
|
Replaced by #250 |
This series removes the A100 specific hard-coded settings and adds a parser to automatically detect all available partitions.
The parser does not depend on NVML but uses the output of
nvidia-smito extract the partition data.We internally tested this on an A30 GPU.