Hi there,
In your Text Classification chapter, "Pretraining a word embedding", you use the sigmoid as your last activation before a .. sparse_categorical_crossentropy?? Looks like the results are the same if you use softmax (some all-too-nice reduction pipeline behind the scene, I'm assuming), but this looks like a typo, no (since all the other models in the chapter have the sigmoid because of the binary sentiment task)? This part in the book also has no comment about this, and Chapter 6 has a neat reminder about the appropriate losses & loss functions, not ideal from a pedagogical point of view?
I hope I haven't overlooked something major, let me know your thoughts !