Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
*
!datasets/
!dataset/
!docker/
46 changes: 43 additions & 3 deletions dataset/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,15 @@ python3 dataset/dataset_setup.py \
--<optional_flags>
```

The complete benchmark uses 6 different datasets:
The complete benchmark uses 7 different datasets:

- [OGBG](#ogbg)
- [WMT](#wmt)
- [FastMRI](#fastmri)
- [Imagenet](#imagenet)
- [Criteo 1TB](#criteo1tb)
- [Librispeech](#librispeech)
- [Fineweb-edu 10B](#fineweb-edu-10b)

Some dataset setups will require you to sign a third-party agreement with the dataset owners in order to get the download URLs.

Expand Down Expand Up @@ -456,11 +457,50 @@ python3 librispeech_preprocess.py --data_dir=$DATA_DIR/librispeech --tokenizer_v
```

### Fineweb-EDU 10B
From `algorithmic-efficiency` run:

The preprocessing script will download and tokenize a 10 bilion token sample of FinewebEdu from Huggingface. The raw dataset will be saved in `tmp_dir/fwedu_10B_raw`, the tokenized dataset in `data_dir/fwedu_10B_tokenized`, and the train, valid split in `data_dir/fineweb_edu_10B`.

```bash
python3 dataset/dataset_setup.py \
--data_dir $DATA_DIR \
--temp_dir $DATA_DIR/tmp \
--fineweb_edu
```
```

<details>
<summary>The final directory structure should look like this:</summary>

```bash
$DATA_DIR
├──fineweb_edu_10B
│ ├── fwedu_10B_tokenized
│ │ ├── data-00000-of-00080.arrow
│ │ ├── data-00001-of-00080.arrow
│ │ ├── data-00002-of-00080.arrow
│ │ ├── [...]
│ │ ├── data-00078-of-00080.arrow
│ │ ├── data-00079-of-00080.arrow
│ │ ├── dataset_info.json
│ │ └── state.json
│ ├── train
│ │ ├── 11814516993635243069
│ │ │ └── 00000000.shard
│ │ │ └── 00000000.snapshot
│ │ ├── 1309159339089188891
│ │ ├── 13196585434617636667
│ │ ├── 13328239765396585889
│ │ ├── 13443989554399185472
│ │ ├── 17062004665044410656
│ │ ├── 832373293846386485
│ │ ├── 9244072261762587327
│ │ ├── dataset_spec.pb
│ │ └── snapshot.metadata
│ └── val
│ ├── 8122001362029945413
│ │ └── 00000000.shard
│ │ └── 00000000.snapshot
│ ├── dataset_spec.pb
│ └── snapshot.metadata
```
In total, it should contain 88 files (via `find -type f | wc -l`) for a total of 112G GB (via `du -sch --apparent-size fineweb_edu_10B/`).
</details>
20 changes: 6 additions & 14 deletions dataset/dataset_setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -782,29 +782,21 @@ def download_finewebedu(
):
"""Download FineWebEdu-10B."""

if not skip_download:
data_dir = os.path.join(data_dir, 'fineweb_edu_10B')
tmp_dir = tmp_dir if tmp_dir is not None else '/tmp'
cache_dir = (
os.path.join(tmp_dir, 'lm')
if tmp_dir is not None
else os.path.expanduser('~/.cache/huggingface/datasets')
)

_maybe_mkdir(data_dir)
_maybe_mkdir(tmp_dir)
_maybe_mkdir(cache_dir)
data_dir = os.path.join(data_dir, 'fineweb_edu_10B')
_maybe_mkdir(data_dir)
_maybe_mkdir(tmp_dir)

if not skip_download:
os.environ['TMPDIR'] = tmp_dir

ds = hf_datasets.load_dataset(
'HuggingFaceFW/fineweb-edu',
name='sample-10BT',
split='train',
cache_dir=cache_dir,
cache_dir=tmp_dir,
)
ds.save_to_disk(os.path.join(tmp_dir, 'fwedu_10B_raw'))
else:
elif not skip_tokenization:
ds = hf_datasets.load_from_disk(os.path.join(tmp_dir, 'fwedu_10B_raw'))

if not skip_tokenization:
Expand Down
2 changes: 1 addition & 1 deletion docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@ algorithm in `algorithms/target_setting_algorithms/`.
We also have regression tests available in
[.github/workflows/regression_tests.yml](.github/workflows/regression_tests.yml)
that can be run semi-automatically. The regression tests are shorter end-to-end
submissions run in a containerized environment across all 8 workloads, in both
submissions run in a containerized environment across all 9 workloads, in both
the JAX and PyTorch frameworks. The regression tests run on self-hosted runners
and are triggered for pull requests that target the main branch. Typically these
PRs will be from the `dev` branch so the tests will run containers based on
Expand Down
6 changes: 3 additions & 3 deletions docs/GETTING_STARTED.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,11 +219,11 @@ Users that wish to customize their images are invited to check and modify the

## Download the Data

The workloads in this benchmark use 6 different datasets across 8 workloads. You
The workloads in this benchmark use 6 different datasets across 9 workloads. You
may choose to download some or all of the datasets as you are developing your
submission, but your submission will be scored across all 8 workloads. For
submission, but your submission will be scored across all 9 workloads. For
instructions on obtaining and setting up the datasets see
[datasets/README](/datasets/README.md#dataset-setup).
[dataset/README](/dataset/README.md#dataset-setup).

## Develop your Submission

Expand Down
Loading