Tokenization speedup and llama3-like weight init by le1nux · Pull Request #432 · Modalities/modalities

le1nux · 2026-02-17T10:41:39Z

What does this PR do?

improves throughput of tokenization
adds llama3-like weight initialization
fixes failing warmstart torchrun test

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
[] I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…ng more efficient data routines

…k if the dataset provides enough tokens.

src/modalities/dataloader/preprocessing/tokenization/tokenized_file_writer.py

BlueCrescent · 2026-02-27T19:37:09Z

src/modalities/dataloader/preprocessing/tokenization/tokenized_file_writer.py

+                    raise ValueError(
+                        f"Token values out of range for {token_size_in_bytes} bytes: "
+                        f"min={min_val}, max={max_val}, allowed=[0, {max_allowed}]"
+                    )


Maybe it would be helpful to identify the faulty token (as in the previous implementation) or even better the index in token_data (via enumerate) and the index in arr (via argmax/argmin) here.

…cy check

Llama3 like weight init

…tokenized_file_writer.py

…tionally casts it down to the model dtype. This is improves stability of the weight init.

rrutmann · 2026-03-06T12:13:52Z

src/modalities/config/instantiation_models.py

-                f"Expected: >={self.settings.training_target.num_target_tokens}"
-            )
+        dataset_tokens = len(self.train_dataset) * self.settings.step_profile.sequence_length
+        expected_tokens = self.settings.training_target.num_target_tokens


The name expected_tokens is a bit misleading here, since this is not the expected size of the dataset, but rather the num_target_tokens. Maybe rename to num_target_tokens?

src/modalities/models/gpt2/llama3_like_initialization.py

le1nux added 2 commits February 17, 2026 11:41

refactor: improved the throughput of the tokenized file writer by usi…

14287e3

…ng more efficient data routines

feat: introduced enforce_enough_tokens_in_dataset for enabling a chec…

e97578d

…k if the dataset provides enough tokens.

le1nux requested a review from BlueCrescent February 27, 2026 17:46

le1nux marked this pull request as ready for review February 27, 2026 17:46

BlueCrescent approved these changes Feb 27, 2026

View reviewed changes

le1nux added 11 commits March 4, 2026 18:36

feat: implemented Llama3-like initialization for GPT2 models

5f5e616

feat: implemented llama3 weight init tests

4dea496

feat: added Llama3-like initialization test config

b704331

chore: improved test coverage for llama-like weight init

34a8621

refactor: we only init bias if set to true in config. added consisten…

c7bcaaa

…cy check

refactor: changed the intialization to allow for depth_init

2a171aa

refactor: removed bias from llama3 init test and added depth_init tests

43a9d50

chore: removed redundant consistency check

2549e8b

Merge pull request #435 from Modalities/llama3_like_weight_init

6a17097

Llama3 like weight init

chore: replaced bitshift by exponentiation for better readability in …

e08e56f

…tokenized_file_writer.py

refactor: Llama3 weight init always uses now fp32 trunc normal and op…

3434e93

…tionally casts it down to the model dtype. This is improves stability of the weight init.

le1nux requested a review from rrutmann March 6, 2026 12:38

rrutmann approved these changes Mar 6, 2026

View reviewed changes

refactor: fixed torchrun warmstart test

c3cf4f5

le1nux changed the title ~~refactor: improved the throughput of the tokenized file writer by usi…~~ Tokenization speedup and llama3-like weight init Mar 6, 2026

chore: added docstrings to Llama3Initializer

bb1fd90

le1nux merged commit 627f84e into main Mar 6, 2026
3 checks passed

le1nux deleted the improve_data_writeout_perf branch March 6, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization speedup and llama3-like weight init#432

Tokenization speedup and llama3-like weight init#432
le1nux merged 15 commits intomainfrom
improve_data_writeout_perf

le1nux commented Feb 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

BlueCrescent Feb 27, 2026

Uh oh!

rrutmann Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

le1nux commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist before submitting final PR

Uh oh!

Uh oh!

BlueCrescent Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

rrutmann Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

le1nux commented Feb 17, 2026 •

edited

Loading