-
Notifications
You must be signed in to change notification settings - Fork 232
Streamline KD & QAD transformers Trainers #708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #708 +/- ##
==========================================
- Coverage 74.69% 74.62% -0.07%
==========================================
Files 192 192
Lines 18946 18989 +43
==========================================
+ Hits 14152 14171 +19
- Misses 4794 4818 +24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
examples/llm_qat/README.md
Outdated
| > **_NOTE:_** `launch.sh` defaults to use `LlamaDecoderLayer` as the transformer layer class. If your model uses a different class, you need to pass `--fsdp_transformer_layer_cls_to_wrap <your_layer_class>` to the `launch.sh` script. For example, for `Qwen/Qwen3-8B`, specify `--fsdp_transformer_layer_cls_to_wrap Qwen3DecoderLayer` as an additional argument. | ||
| > **_NOTE:_** The script defaults to using FSDP1. To use FSDP2, pass "--use_fsdp2 True" to the `launch.sh` script. Note that FSDP2 is less stable than FSDP1 currently. Use it with caution. | ||
| > **_NOTE:_** The script defaults to using FSDP1. To use FSDP2, pass "--backend=fsdp2" to the `launch.sh` script. Note that FSDP2 is less stable than FSDP1 currently. Use it with caution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this statement still valid? Note that FSDP2 is less stable than FSDP1 currently. Use it with caution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt it, but I don't have proof. I don't have proof that it is less stable either.
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
bde5788 to
190e4d2
Compare
c4c0d19 to
9f3b0f8
Compare
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
9f3b0f8 to
50cb0f4
Compare
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
f6e1196 to
796b023
Compare
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
| # Note: QAD doesn't work with FSDP wrapped model. We quantize model before the wrapper. | ||
| # The drawback is that we can't train a model that is bigger than a single GPU memory. | ||
| # And memory efficient loading doesn't work. | ||
| # Note: FSDP memory efficient loading doesn't work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wont be needed if we do https://github.com/NVIDIA/Model-Optimizer/pull/708/files#r2677776721
| # Note: FSDP memory efficient loading doesn't work. |
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
realAsma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!!
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
What does this PR do?
Type of change: ? Refactor and stabilization
Overview:
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information