Skip to content

Tokenization speedup and llama3-like weight init#432

Merged
le1nux merged 15 commits intomainfrom
improve_data_writeout_perf
Mar 6, 2026
Merged

Tokenization speedup and llama3-like weight init#432
le1nux merged 15 commits intomainfrom
improve_data_writeout_perf

Conversation

@le1nux
Copy link
Member

@le1nux le1nux commented Feb 17, 2026

What does this PR do?

  • improves throughput of tokenization
  • adds llama3-like weight initialization
  • fixes failing warmstart torchrun test

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • [] I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@le1nux le1nux requested a review from BlueCrescent February 27, 2026 17:46
@le1nux le1nux marked this pull request as ready for review February 27, 2026 17:46
Comment on lines +111 to +114
raise ValueError(
f"Token values out of range for {token_size_in_bytes} bytes: "
f"min={min_val}, max={max_val}, allowed=[0, {max_allowed}]"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be helpful to identify the faulty token (as in the previous implementation) or even better the index in token_data (via enumerate) and the index in arr (via argmax/argmin) here.

@le1nux le1nux requested a review from rrutmann March 6, 2026 12:38
f"Expected: >={self.settings.training_target.num_target_tokens}"
)
dataset_tokens = len(self.train_dataset) * self.settings.step_profile.sequence_length
expected_tokens = self.settings.training_target.num_target_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name expected_tokens is a bit misleading here, since this is not the expected size of the dataset, but rather the num_target_tokens. Maybe rename to num_target_tokens?

@le1nux le1nux changed the title refactor: improved the throughput of the tokenized file writer by usi… Tokenization speedup and llama3-like weight init Mar 6, 2026
@le1nux le1nux merged commit 627f84e into main Mar 6, 2026
3 checks passed
@le1nux le1nux deleted the improve_data_writeout_perf branch March 6, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants