GitHub - viniciusarruda/word2vec: Yet Another Word2Vec Implementation

Source: The Illustrated Word2vec

Yet Another Word2Vec Implementation

About

Implementation of Continuous Bag-of-Words (CBOW) in pytorch.

Features:

Train a CBOW from scratch
Log training to tensorboard
Visualize embeddings with t-SNE/PCA/UMAP using tensorboard.
Implements a most_similar function with the same behavior and results of the most_similar function implemented by the Gensim library.

Installation

Note:

This project was developed using Windows 11 with python 3.10.0.

Clone this repo, create a new environment (recommended) and install the dependencies:

pip install -r requirements.txt

Usage

Train a CBOW model

Download the dataset WikiText-2 or WikiText-103 here and move it into the dataset folder.

Edit the config.toml accordingly, then:

python main.py

To use tensorboard (setting scalars to show all datapoints):

tensorboard --logdir .\experiment\wikitext-2\ --samples_per_plugin scalars=300000

Compute analogies

To compute the analogies and summarize them using the word-test.v1.txt, the original test set file from the word2vec paper.

To run the original trained word2vec (it will download the model):

python compute_analogies.py word2vec-google-news-300

The results from the above script can be seen below.

To run with a trained word2vec, use the path from a txt file containing the word vectors:

python compute_analogies.py <path-to-txt-word-vectors>

Checking most_similar implementation

most_similar is a function from the Gensim which retrieves the top-N most similar embeddings. The goal of the most_similar_implementation_check.py script is to assert the equality of results between most_similar implementation.

To run the original trained word2vec (it will download the model):

python most_similar_implementation_check.py word2vec-google-news-300

Or use the path from a txt file containing the word vectors:

python most_similar_implementation_check.py <path-to-txt-word-vectors>

Results

word2vec-google-news-300

Analogy Class	OOV	not OOV	Top1	Top5	Total
capital-common-countries	0 (0.00%)	506 (100.00%)	421 (83.20%)	482 (95.26%)	506
capital-world	0 (0.00%)	4524 (100.00%)	3580 (79.13%)	4124 (91.16%)	4524
currency	0 (0.00%)	866 (100.00%)	304 (35.10%)	431 (49.77%)	866
city-in-state	0 (0.00%)	2467 (100.00%)	1749 (70.90%)	2127 (86.22%)	2467
family	0 (0.00%)	506 (100.00%)	428 (84.58%)	482 (95.26%)	506
gram1-adjective-to-adverb	0 (0.00%)	992 (100.00%)	283 (28.53%)	509 (51.31%)	992
gram2-opposite	0 (0.00%)	812 (100.00%)	347 (42.73%)	457 (56.28%)	812
gram3-comparative	0 (0.00%)	1332 (100.00%)	1210 (90.84%)	1295 (97.22%)	1332
gram4-superlative	0 (0.00%)	1122 (100.00%)	980 (87.34%)	1102 (98.22%)	1122
gram5-present-participle	0 (0.00%)	1056 (100.00%)	825 (78.12%)	1004 (95.08%)	1056
gram6-nationality-adjective	0 (0.00%)	1599 (100.00%)	1438 (89.93%)	1527 (95.50%)	1599
gram7-past-tense	0 (0.00%)	1560 (100.00%)	1029 (65.96%)	1459 (93.53%)	1560
gram8-plural	0 (0.00%)	1332 (100.00%)	1197 (89.86%)	1275 (95.72%)	1332
gram9-plural-verbs	0 (0.00%)	870 (100.00%)	591 (67.93%)	785 (90.23%)	870
Total	0 (0.00%)	19544 (100.00%)	14382 (73.59%)	17059 (87.29%)	19544

Resources

Links regarding the most_similar and analogy computation: 1, 2, 3, 4.
Tensorboard: 1, 2, 3.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
dataset		dataset
experiment		experiment
images		images
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Yet Another Word2Vec Implementation

About

Installation

Usage

Train a CBOW model

Compute analogies

Checking most_similar implementation

Results

word2vec-google-news-300

Resources

About

Uh oh!

Uh oh!

Languages

viniciusarruda/word2vec

Folders and files

Latest commit

History

Repository files navigation

Yet Another Word2Vec Implementation

About

Installation

Usage

Train a CBOW model

Compute analogies

Checking most_similar implementation

Results

word2vec-google-news-300

Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages