I'd just like you to know that code with permissive licensing with attribution requirements are possibly unsuitable for training set inclusion. I'm bringing this to your attention not as a lawyer, but as a maintainer. Ask your own council. However, attribution requirements usually means derivatives must retain attribution of the original author. LLMs are apparently well-known to occasionally spit out exact derivatives, but without satisfying attribution requirements, which suggests this practice could be illegal.
I therefore request you at the very least process opt-out requests in retrospect for pre-existing data sets to fix this. However, just to stress this again, I'm not a lawyer and this isn't legal advice. But at least from the outside, this looks troubling.
For example, it appears you included repositories of mine that have attribution requirements:

I don't understand how StarCoder would possibly satisfy them.