diff --git a/.dockerignore b/.dockerignore index d11b915f9..3585458ef 100644 --- a/.dockerignore +++ b/.dockerignore @@ -1,3 +1,3 @@ * -!datasets/ +!dataset/ !docker/ \ No newline at end of file diff --git a/dataset/README.md b/dataset/README.md index d08f4cf67..664a689d3 100644 --- a/dataset/README.md +++ b/dataset/README.md @@ -31,7 +31,7 @@ python3 dataset/dataset_setup.py \ -- ``` -The complete benchmark uses 6 different datasets: +The complete benchmark uses 7 different datasets: - [OGBG](#ogbg) - [WMT](#wmt) @@ -39,6 +39,7 @@ The complete benchmark uses 6 different datasets: - [Imagenet](#imagenet) - [Criteo 1TB](#criteo1tb) - [Librispeech](#librispeech) +- [Fineweb-edu 10B](#fineweb-edu-10b) Some dataset setups will require you to sign a third-party agreement with the dataset owners in order to get the download URLs. @@ -456,11 +457,50 @@ python3 librispeech_preprocess.py --data_dir=$DATA_DIR/librispeech --tokenizer_v ``` ### Fineweb-EDU 10B -From `algorithmic-efficiency` run: + +The preprocessing script will download and tokenize a 10 bilion token sample of FinewebEdu from Huggingface. The raw dataset will be saved in `tmp_dir/fwedu_10B_raw`, the tokenized dataset in `data_dir/fwedu_10B_tokenized`, and the train, valid split in `data_dir/fineweb_edu_10B`. ```bash python3 dataset/dataset_setup.py \ --data_dir $DATA_DIR \ --temp_dir $DATA_DIR/tmp \ --fineweb_edu -``` \ No newline at end of file +``` + +
+The final directory structure should look like this: + +```bash +$DATA_DIR +├──fineweb_edu_10B +│ ├── fwedu_10B_tokenized +│ │ ├── data-00000-of-00080.arrow +│ │ ├── data-00001-of-00080.arrow +│ │ ├── data-00002-of-00080.arrow +│ │ ├── [...] +│ │ ├── data-00078-of-00080.arrow +│ │ ├── data-00079-of-00080.arrow +│ │ ├── dataset_info.json +│ │ └── state.json +│ ├── train +│ │ ├── 11814516993635243069 +│ │ │ └── 00000000.shard +│ │ │ └── 00000000.snapshot +│ │ ├── 1309159339089188891 +│ │ ├── 13196585434617636667 +│ │ ├── 13328239765396585889 +│ │ ├── 13443989554399185472 +│ │ ├── 17062004665044410656 +│ │ ├── 832373293846386485 +│ │ ├── 9244072261762587327 +│ │ ├── dataset_spec.pb +│ │ └── snapshot.metadata +│ └── val +│ ├── 8122001362029945413 +│ │ └── 00000000.shard +│ │ └── 00000000.snapshot +│ ├── dataset_spec.pb +│ └── snapshot.metadata +``` +In total, it should contain 88 files (via `find -type f | wc -l`) for a total of 112G GB (via `du -sch --apparent-size fineweb_edu_10B/`). +
diff --git a/dataset/dataset_setup.py b/dataset/dataset_setup.py index de5e9d271..30bd9dec6 100644 --- a/dataset/dataset_setup.py +++ b/dataset/dataset_setup.py @@ -782,29 +782,21 @@ def download_finewebedu( ): """Download FineWebEdu-10B.""" - if not skip_download: - data_dir = os.path.join(data_dir, 'fineweb_edu_10B') - tmp_dir = tmp_dir if tmp_dir is not None else '/tmp' - cache_dir = ( - os.path.join(tmp_dir, 'lm') - if tmp_dir is not None - else os.path.expanduser('~/.cache/huggingface/datasets') - ) - - _maybe_mkdir(data_dir) - _maybe_mkdir(tmp_dir) - _maybe_mkdir(cache_dir) + data_dir = os.path.join(data_dir, 'fineweb_edu_10B') + _maybe_mkdir(data_dir) + _maybe_mkdir(tmp_dir) + if not skip_download: os.environ['TMPDIR'] = tmp_dir ds = hf_datasets.load_dataset( 'HuggingFaceFW/fineweb-edu', name='sample-10BT', split='train', - cache_dir=cache_dir, + cache_dir=tmp_dir, ) ds.save_to_disk(os.path.join(tmp_dir, 'fwedu_10B_raw')) - else: + elif not skip_tokenization: ds = hf_datasets.load_from_disk(os.path.join(tmp_dir, 'fwedu_10B_raw')) if not skip_tokenization: diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index ae11154a9..c82556137 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -297,7 +297,7 @@ algorithm in `algorithms/target_setting_algorithms/`. We also have regression tests available in [.github/workflows/regression_tests.yml](.github/workflows/regression_tests.yml) that can be run semi-automatically. The regression tests are shorter end-to-end -submissions run in a containerized environment across all 8 workloads, in both +submissions run in a containerized environment across all 9 workloads, in both the JAX and PyTorch frameworks. The regression tests run on self-hosted runners and are triggered for pull requests that target the main branch. Typically these PRs will be from the `dev` branch so the tests will run containers based on diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md index e8ea1734e..afb10e798 100644 --- a/docs/GETTING_STARTED.md +++ b/docs/GETTING_STARTED.md @@ -219,11 +219,11 @@ Users that wish to customize their images are invited to check and modify the ## Download the Data -The workloads in this benchmark use 6 different datasets across 8 workloads. You +The workloads in this benchmark use 6 different datasets across 9 workloads. You may choose to download some or all of the datasets as you are developing your -submission, but your submission will be scored across all 8 workloads. For +submission, but your submission will be scored across all 9 workloads. For instructions on obtaining and setting up the datasets see -[datasets/README](/datasets/README.md#dataset-setup). +[dataset/README](/dataset/README.md#dataset-setup). ## Develop your Submission