Implementation of Continuous Bag-of-Words (CBOW) in pytorch.
Features:
- Train a CBOW from scratch
- Log training to tensorboard
- Visualize embeddings with t-SNE/PCA/UMAP using tensorboard.
- Implements a
most_similarfunction with the same behavior and results of themost_similarfunction implemented by theGensimlibrary.
Note:
This project was developed using
Windows 11withpython 3.10.0.
Clone this repo, create a new environment (recommended) and install the dependencies:
pip install -r requirements.txtDownload the dataset WikiText-2 or WikiText-103 here and move it into the dataset folder.
Edit the config.toml accordingly, then:
python main.pyTo use tensorboard (setting scalars to show all datapoints):
tensorboard --logdir .\experiment\wikitext-2\ --samples_per_plugin scalars=300000To compute the analogies and summarize them using the word-test.v1.txt, the original test set file from the word2vec paper.
To run the original trained word2vec (it will download the model):
python compute_analogies.py word2vec-google-news-300The results from the above script can be seen below.
To run with a trained word2vec, use the path from a txt file containing the word vectors:
python compute_analogies.py <path-to-txt-word-vectors>most_similar is a function from the Gensim which retrieves the top-N most similar embeddings. The goal of the most_similar_implementation_check.py script is to assert the equality of results between most_similar implementation.
To run the original trained word2vec (it will download the model):
python most_similar_implementation_check.py word2vec-google-news-300Or use the path from a txt file containing the word vectors:
python most_similar_implementation_check.py <path-to-txt-word-vectors>| Analogy Class | OOV | not OOV | Top1 | Top5 | Total |
|---|---|---|---|---|---|
| capital-common-countries | 0 (0.00%) | 506 (100.00%) | 421 (83.20%) | 482 (95.26%) | 506 |
| capital-world | 0 (0.00%) | 4524 (100.00%) | 3580 (79.13%) | 4124 (91.16%) | 4524 |
| currency | 0 (0.00%) | 866 (100.00%) | 304 (35.10%) | 431 (49.77%) | 866 |
| city-in-state | 0 (0.00%) | 2467 (100.00%) | 1749 (70.90%) | 2127 (86.22%) | 2467 |
| family | 0 (0.00%) | 506 (100.00%) | 428 (84.58%) | 482 (95.26%) | 506 |
| gram1-adjective-to-adverb | 0 (0.00%) | 992 (100.00%) | 283 (28.53%) | 509 (51.31%) | 992 |
| gram2-opposite | 0 (0.00%) | 812 (100.00%) | 347 (42.73%) | 457 (56.28%) | 812 |
| gram3-comparative | 0 (0.00%) | 1332 (100.00%) | 1210 (90.84%) | 1295 (97.22%) | 1332 |
| gram4-superlative | 0 (0.00%) | 1122 (100.00%) | 980 (87.34%) | 1102 (98.22%) | 1122 |
| gram5-present-participle | 0 (0.00%) | 1056 (100.00%) | 825 (78.12%) | 1004 (95.08%) | 1056 |
| gram6-nationality-adjective | 0 (0.00%) | 1599 (100.00%) | 1438 (89.93%) | 1527 (95.50%) | 1599 |
| gram7-past-tense | 0 (0.00%) | 1560 (100.00%) | 1029 (65.96%) | 1459 (93.53%) | 1560 |
| gram8-plural | 0 (0.00%) | 1332 (100.00%) | 1197 (89.86%) | 1275 (95.72%) | 1332 |
| gram9-plural-verbs | 0 (0.00%) | 870 (100.00%) | 591 (67.93%) | 785 (90.23%) | 870 |
| Total | 0 (0.00%) | 19544 (100.00%) | 14382 (73.59%) | 17059 (87.29%) | 19544 |