- This implementation is the same as
Transformers.Bert with a tiny embeddings tweaks.
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE(implemented in
BPE.jl) as a tokenizer (same as GPT-2) and uses a different pre-training scheme.
- RoBERTa doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation (or
</s>)
we can also wrapper Camembert (or the french version of BERT) around RoBERT.