New Transformer (General-purpose, Quantization resilient / Enables 1.58-bit ternary training without STE / Long-term memory / Stable convergence), New Optimizer (General-purpose), and New Scheduler (General-purpose) #1964
muooon
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I am sharing this post to contribute back to the open-source community.
I always have immense respect and gratitude for the pioneers of bits and bytes and their incredible achievements.
Though I'm just an amateur, I have released some general-purpose tools. Each of them is completely independent and designed for general use (not optimized for any specific task).
1, New Transformer Architecture: D-RNA
2, New Optimizer Family: emo series (includes 5 variations)
3, New Scheduler: emoPulse (derived from the emo optimizer)
All of these function well even with ternary (1.58-bit) quantization and maintain high adaptability with other quantization methods as well.
While I’m not certain if this directly serves the bits and bytes ecosystem, I would highly appreciate it if you could take a look at these three projects when you have a moment.
D-RNA : https://github.com/muooon/DRNA
1.58bit sample : https://github.com/muooon/DRNA/tree/drna/158b_train_sample
emo optimizer : https://github.com/muooon/EmoSens
emo scheduler : https://github.com/muooon/EmoSens/tree/v3.9.0_ecc/scheduler
Key features of D-RNA :
It can be used to build security-adaptive models, public keys, private keys, and models. This architecture allows you to freeze the base model and expand it using LoRA in an MoE (Mixture of Experts) fashion.
D-RNA is highly versatile and can be utilized in many different ways.
It maintains compatibility with standard Transformers, meaning you can easily port existing weights over.
Key features of emo optim, emo scheduler :
Both the emo optimizers and the Scheduler feature Auto-LR (Automatic Learning Rate) capabilities.
This mechanism derives the learning rate directly from the Loss rather than measuring gradients, ensuring stable convergence even under quantization constraints.
Beta Was this translation helpful? Give feedback.
All reactions