if no expert found in parameter that have expert in name the loop should continue #7685
Draft
LckyLke wants to merge 9 commits intodeepspeedai:masterfrom
Draft
if no expert found in parameter that have expert in name the loop should continue #7685LckyLke wants to merge 9 commits intodeepspeedai:masterfrom
LckyLke wants to merge 9 commits intodeepspeedai:masterfrom
Conversation
… moe Signed-off-by: Luke Friedrichs <lukefriedrichs@gmail.com>
Signed-off-by: Luke Friedrichs <lukefriedrichs@gmail.com>
1. `modal-accelerate` needs now `uv` installed explicitly since the image change to 2025 one. 2. moved accelerate repo cloning into the job, since the original way was incorrect as it was caching some accelerate version and not updating it. 3. annotated that how to actually test the ci work when changing the workflow as `pull_request_target` will not run the updated .py+.yaml files. --------- Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: Luke Friedrichs <lukefriedrichs@gmail.com>
add Masahiro's explanation to why that code is there. --------- Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: Luke Friedrichs <lukefriedrichs@gmail.com>
as we lost v100s - disable first so that it stops interfering with PRs, then port to modal. Signed-off-by: Luke Friedrichs <lukefriedrichs@gmail.com>
…er (deepspeedai#7658) This PR allows seperate learning rate for muon and adam part of the Muon optimizer. Following up deepspeedai#7657 Signed-off-by: Guokai Ma <guokai.ma@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Luke Friedrichs <lukefriedrichs@gmail.com>
…ntinue Otherwise: `TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'` Signed-off-by: Luke Friedrichs <lukefriedrichs@gmail.com>
…o import moe" This reverts commit 2f232b9. Signed-off-by: Luke Friedrichs <lukefriedrichs@gmail.com>
Collaborator
|
@stas00, FYI |
Author
Convert this is to a draft because just continue is not sufficient, as the parameter is not saved at all in this case, so loading the model again than fails. |
Collaborator
|
I haven't gotten to saving checkpoints yet, so I don't have the understanding of this code yet. It's interesting someone is using this old implementation! @LckyLke, we are working on modernizing the original DS-MoE here snowflakedb/ArcticTraining#272 - currently qwen3-moe and qwen3-next are supported - but no checkpoint saving yet... will come later. |
Author
|
@stas00 thanks for the info I will definitely check it out :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I have implemented some custom logic in the deeepspeed_moe classes and having "expert" in any parameter name breaks the saving function for checkpoints.
The warning triggers since the code founds an expert (by name) which is not one:
[WARNING] [engine.py:3597:_save_moe_checkpoint] No expert found in key transformer.layers.0.1.deepspeed_moe.gate.wg.experts_mask.but as we do not continue the loop this error happens still:
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'A simple continue fixes this :)