Skip to content

Conversation

@einsteinXue
Copy link

This PR is only for code review. makefile not modified yet

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2025
@openshift-ci
Copy link

openshift-ci bot commented Mar 17, 2025

Hi @einsteinXue. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 17, 2025
@thom311
Copy link
Contributor

thom311 commented Mar 17, 2025

The CR has two fields ManualReboot and ManualUpgradeSDK.

From the meaning and the usage of those fields, this sounds imperative (like: user sets the flag, and then an action should happen). Unlike describing the end state we want to reach.

For ManualUpgradeSDK, shouldn't we instead configure the (desired) FirmwareVersion (which might be set to "latest", to automatically upgrade to the latest). The operator would then compare the DPU's firmware version with the desired one, and trigger an upgrade as requested. This requires, that the operator can check the current firmware version (and do so in a relatively non-expensive way).

For ManualReboot, that approach wouldn't work. But in that case, maybe the user should create a DpuReboot custom resource. Which represents the "one-shot-job", maybe similar to Kubernetes' Job. The operator would watch the CR, trigger the reboot and update the status of the CR. The operator could even implement this by starting a Kubernetes Job (which already implements concepts like retry).

Maybe what I said is not that different from what is implemented (at least for ManualReboot). However, the names "Manual" make this appear as being an imperative command, when the API should aim to have a more declarative feel. For ManualReboot, this seems to be more about the naming.

@wizhaoredhat
Copy link
Contributor

wizhaoredhat commented Mar 17, 2025

The CR has two fields ManualReboot and ManualUpgradeSDK.

From the meaning and the usage of those fields, this sounds imperative (like: user sets the flag, and then an action should happen). Unlike describing the end state we want to reach.

For ManualUpgradeSDK, shouldn't we instead configure the (desired) FirmwareVersion (which might be set to "latest", to automatically upgrade to the latest). The operator would then compare the DPU's firmware version with the desired one, and trigger an upgrade as requested. This requires, that the operator can check the current firmware version (and do so in a relatively non-expensive way).

For ManualReboot, that approach wouldn't work. But in that case, maybe the user should create a DpuReboot custom resource. Which represents the "one-shot-job", maybe similar to Kubernetes' Job. The operator would watch the CR, trigger the reboot and update the status of the CR. The operator could even implement this by starting a Kubernetes Job (which already implements concepts like retry).

Maybe what I said is not that different from what is implemented (at least for ManualReboot). However, the names "Manual" make this appear as being an imperative command, when the API should aim to have a more declarative feel. For ManualReboot, this seems to be more about the naming.

I agree. This is how the Bare Metal Operator does this:
https://book.metal3.io/bmo/reboot_annotation

Do you see something wrong with this approach?

@wizhaoredhat
Copy link
Contributor

@einsteinXue Is this tested to be working in your setup?

@einsteinXue einsteinXue force-pushed the synaxg_plugin_dev branch 5 times, most recently from e038c1c to f962c6d Compare March 20, 2025 07:22
@einsteinXue
Copy link
Author

@einsteinXue Is this tested to be working in your setup?

not tested yet. What I would like to confirm is whether the current implementation is feasible.

@einsteinXue
Copy link
Author

The CR has two fields ManualReboot and ManualUpgradeSDK.

From the meaning and the usage of those fields, this sounds imperative (like: user sets the flag, and then an action should happen). Unlike describing the end state we want to reach.

For ManualUpgradeSDK, shouldn't we instead configure the (desired) FirmwareVersion (which might be set to "latest", to automatically upgrade to the latest). The operator would then compare the DPU's firmware version with the desired one, and trigger an upgrade as requested. This requires, that the operator can check the current firmware version (and do so in a relatively non-expensive way).

For ManualReboot, that approach wouldn't work. But in that case, maybe the user should create a DpuReboot custom resource. Which represents the "one-shot-job", maybe similar to Kubernetes' Job. The operator would watch the CR, trigger the reboot and update the status of the CR. The operator could even implement this by starting a Kubernetes Job (which already implements concepts like retry).

Maybe what I said is not that different from what is implemented (at least for ManualReboot). However, the names "Manual" make this appear as being an imperative command, when the API should aim to have a more declarative feel. For ManualReboot, this seems to be more about the naming.

Thanks for the comments, I will consider about your comments

@einsteinXue
Copy link
Author

https://book.metal3.io/bmo/reboot_annotation

Let me read this. Thank you!

@bn222
Copy link
Contributor

bn222 commented Apr 17, 2025

For ManualUpgradeSDK, shouldn't we instead configure the (desired) FirmwareVersion (which might be set to "latest", to automatically upgrade to the latest). The operator would then compare the DPU's firmware version with the desired one, and trigger an upgrade as requested. This requires, that the operator can check the current firmware version (and do so in a relatively non-expensive way).

I agree, if the users wants to bring the system into a state where the firmware is upgraded, then he should just specify the desired state, and the upgrade should then happen as part of a reconciliation.

ManualReboot

We considered using a job for this, although we ended up with something that looks like what is implemented in this PR. I see how this looks like a job, although I don't see how that would be easy to create. I prefer to have a ManualRebootRequested field that the user sets to true and then we reconcile it back to false by doing the reboot just like it's done here: 5a05067#diff-10032ffdd17d4bb235e7916d1a7ce1514ffea6d6a0bbc19b27e580cca4ee54f2R55

@thom311 ^

@einsteinXue : In general, I agree with what you're proposing here, although these fields will move to another CR which are added by another PR. That PR is current blocked by other PRs so we will have to wait, although some of the changes in the current PR are sound and it's going to be a matter of rebasing and moving the fields into the CR that will be introduced.

@einsteinXue
Copy link
Author

einsteinXue commented Apr 18, 2025

I agree, if the users wants to bring the system into a state where the firmware is upgraded, then he should just specify the desired state, and the upgrade should then happen as part of a reconciliation.

This is under developing. Thanks for @wizhaoredhat 's suggestion, I have already successfully pulled the desired version of firmware to be upgraded from quay.io.
But how to get the current firmware version still remains to be considered.

In general, I agree with what you're proposing here, although these fields will move to another CR which are added by another PR. That PR is current blocked by other PRs so we will have to wait, although some of the changes in the current PR are sound and it's going to be a matter of rebasing and moving the fields into the CR that will be introduced.

OK, please notify me if the dependent PR is merged. I will do the rebase things.

@thom311 or @bn222 Could you please share some info about how to create VFs within dpu-operator? As you can see in this PR, we are using gRPC to transmit SDK package to dpu, and gRPC relies on VFs.

@bn222
Copy link
Contributor

bn222 commented Apr 18, 2025

Currently, it is hardcoded to a predefined number of VFs. Once we have the DpuConfig CR added, the users will be able to change the number.

@synaxgcom
Copy link

@einsteinXue : In general, I agree with what you're proposing here, although these fields will move to another CR which are added by another PR. That PR is current blocked by other PRs so we will have to wait, although some of the changes in the current PR are sound and it's going to be a matter of rebasing and moving the fields into the CR that will be introduced.

@bn222 Hi Balazs, How about this dependent PR? Was it already merged? Can I start to rebase my SynaXG plugin related code?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2025
@bn222
Copy link
Contributor

bn222 commented Jul 10, 2025

Hi @synaxgcom, we are very close to getting it merged. All the preparatory work has been finished. We are going to merge continue work on DPU CRs next sprint and merge it.

I recommend you start with rebasing today, because the final piece will not add muchuch from what's here today.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2025
@bn222
Copy link
Contributor

bn222 commented Sep 25, 2025

Rebase on top of #574

Add a reboot requested field in that struct and reconcile it

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 25, 2025
@einsteinXue
Copy link
Author

Rebase on top of #574

Add a reboot requested field in that struct and reconcile it

OK, will do. Thanks!

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 2, 2025
@openshift-ci
Copy link

openshift-ci bot commented Dec 2, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: einsteinXue
Once this PR has been reviewed and has the lgtm label, please assign wizhaoredhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants