Conversation
| @property | ||
| def bos_token_id(self): | ||
| raise NotImplementedError("Missing <bos>") | ||
|
|
||
| @property | ||
| def eos_token_id(self): | ||
| raise NotImplementedError("Missing <eos>") |
There was a problem hiding this comment.
It's quite annoying but we use HF's as already so we shouldn't collapse tokens IMO.
tests/test_dataloaders.py
Outdated
There was a problem hiding this comment.
Very basic test ... couldn't think of a more robust test.
| hf_tokenizer_kwargs = {} | ||
| if vocab_extra_ids > 0: | ||
| # TODO @thomasw21 we might need to concatenate to a pre-existing list? | ||
| hf_tokenizer_kwargs["additional_special_tokens"] = [f"<extra_id_{_id}>" for _id in range(vocab_extra_ids)] |
There was a problem hiding this comment.
I think we'll want something cleaner here, but since we're not using additional_special_tokens in our tokenizer I'd say it's okay to override that value. I think at some point we'll push another tokenizer on the hub with the additional tokens. cc @SaulLu
There was a problem hiding this comment.
Yeah this will override the previous special tokens if they exist right? Does it not work to add them later with self.tokenizer.add_special_tokens?
There was a problem hiding this comment.
Not very familiar with this part of tokenzier code but I would also have expected it to expand the vocabulary instead of re-using existing extra tokens :)
There was a problem hiding this comment.
OK yes if the embedding is padded later then this is the right way
There was a problem hiding this comment.
I just checked, whether they are added in the from_pretrained method or with the add_special_tokens method, in both cases the tokens will be added but the value of the additional_special_tokens property will be overwritten.
If we want to keep the previously added tokens in the additional_special_tokens property we need to do:
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, **hf_tokenizer_kwargs)
new_special_tokens = {
"additional_special_tokens": tok.additional_special_tokens + [f"<extra_id_{_id}>" for _id in range(vocab_extra_ids) if f"<extra_id_{_id}>" not in self.tokenizer.additional_special_tokens]
}
self.tokenizer.add_special_tokens(new_special_tokens)There was a problem hiding this comment.
Hum let's not over engineer this ... we're not using any right now, I can add a warning saying we're going to overwrite the additional tokens (otherwise I have to switch the logic a bit for no reason).
There was a problem hiding this comment.
Fine with a warning, but is there anything wrong with @lucile's solution? Ij amy case don't think this is essential, was just curious, as long as it is documented i don't think we should spend too much time on it.
There was a problem hiding this comment.
The issue if that the MLMDataset needs to be able to query the sentinel tokens. Right now I assume that all additional_special_tokens are sentinel tokens. So now we need to build a tokenizer that has specific mlm tokens, I can try and do that just didn't want to do it out of lazyness :D
| spans_start[1:], np.full((1,), len(sample), dtype=np.int32)] | ||
| ) | ||
|
|
||
| sentinel_token_ids = all_sentinel_token_ids[:num_noise_spans] |
There was a problem hiding this comment.
Given num_noise_spans is always the same, maybe slightly faster to store sentinel_token_ids as a class attribute of MLMDataset & feed it as an argument to the func
I wonder if it wouldn't be better to make num_noise_spans probabilistic instead of deterministic
There was a problem hiding this comment.
I also have a strong intuition that we should want to change those values. But the idea is to have T5 mlm here and rely on their number.
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
| up to num_items | ||
| """ | ||
| mask_indices = np.arange(num_items - 1) < (num_segments - 1) | ||
| # TODO @thomasw21 handle random state correctly, ie synchronized across TP. |
There was a problem hiding this comment.
This scares me a bit because TP-random states things are hard to debug but tbh we should just test asap to see if loss goes down at the expected rate.
There was a problem hiding this comment.
Ah yes I need to double check that. I can have a go at it. Have forgotten about this TODO.
TevenLeScao
left a comment
There was a problem hiding this comment.
Read through the PR and didn't catch anything worrying. Let's just test it ASAP.
|
@Muennighoff Waiting for your approval. |
Muennighoff
left a comment
There was a problem hiding this comment.
Nice job! Is the plan roughly as follows?:
Merge this -> Finish & Merge MTF (& Figure out plan for multilingual retention) -> Try out MLM+MTF on small bloom model -> Try out Enc-Dec+MLM+MTF on small bloom model -> Try out best option on bloom176B
|
Not exactly, in the priority order (in case we have idle compute we go to the next item):
|
Modify universal checkpoint parameter patterns based on the specific model configuration. This commit adds support for llama family of models. Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai>
This is the mlm adaptation part
@thomasw21