A from scratch implementation of a decoder-only transformer based language model. You can find my articles about this project here: https://davids.bearblog.dev/mathematical-foundation-of-self-attention/ https://davids.bearblog.dev/mathematical-and-architectural-analysis-of-decoder-only-transformers/
This project gave me insight and perspective on language models, and made me realize how much data is required and how compute intensive language is as a task. As language must not only be semantically and grammatically correct, but also extremely concise and informational.
Some stats about one of the completed training runs:

Visual reprsentation of how attention mechanisms work:
Attention mechanisms on images, the white blur shows where the model attends to.
Image credits: https://arxiv.org/abs/1502.03044
Attention mechanisms working on next word prediction.
Image Credits: https://arxiv.org/abs/1706.03762