You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/data_input_pipeline.md
+15-35Lines changed: 15 additions & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,40 +19,11 @@
19
19
20
20
Currently MaxText has three data input pipelines:
21
21
22
-
::::{grid} 1 2 2 2
23
-
:gutter: 2
24
-
25
-
:::{grid-item-card} 🌾 Grain (Recommended)
26
-
:link: data_input_pipeline/data_input_grain
27
-
:link-type: doc
28
-
29
-
**Features**: With arrayrecord: fully deterministic, resilient to preemption; global shuffle. With parquet: performant; fully deterministic, resilient to preemption; hierarchical shuffle.
30
-
31
-
**Formats**: ArrayRecord, Parquet.
32
-
:::
33
-
34
-
:::{grid-item-card} 🤗 Hugging Face
35
-
:link: data_input_pipeline/data_input_hf
36
-
:link-type: doc
37
-
38
-
**Formats**: Hugging Face Hub, local/GCS (json, parquet, arrow, csv, txt).
39
-
:::
40
-
41
-
:::{grid-item-card} 💾 TFDS
42
-
:link: data_input_pipeline/data_input_hf
43
-
:link-type: doc
44
-
45
-
**Formats**: TFRecord (via Tensorflow Datasets).
46
-
:::
47
-
48
-
:::{grid-item-card} ⚡ Optimizing Performance
49
-
:link: data_input_pipeline/data_pipeline_perf
50
-
:link-type: doc
51
-
52
-
Guide to setting and verifying performance goals to maximize accelerator utilization.
53
-
:::
54
-
55
-
::::
22
+
| Pipeline | Dataset formats | Features | Limitations |
|**[Grain](data_input_pipeline/data_input_grain.md)** (recommended)|[ArrayRecord](https://github.com/google/array_record) (random access, available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview), or [conversion](https://github.com/google/array_record/tree/main/beam))<br>[Parquet](https://arrow.apache.org/docs/python/parquet.html) (sequential access) | With arrayrecord: fully deterministic, resilient to preemption; global shuffle <br>With parquet: performant; fully deterministic, resilient to preemption; hierarchical shuffle ||
25
+
|**[Hugging Face](data_input_pipeline/data_input_hf.md)**| datasets in [Hugging Face Hub](https://huggingface.co/datasets)<br>local/Cloud Storage datasets in json, parquet, arrow, csv, txt (sequential access) | no download needed, convenience; <br>multiple formats | limit scalability using the Hugging Face Hub (no limit using Cloud Storage); <br>non-deterministic with preemption<br>(deterministic without preemption)<br> |
26
+
|**[TFDS](data_input_pipeline/data_input_tfds.md)**| TFRecord (sequential access), available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview)| performant | only supports TFRecords; <br>non-deterministic with preemption<br>(deterministic without preemption) |
56
27
57
28
## Multihost dataloading best practice
58
29
Training in a multi-host environment presents unique challenges for data input pipelines. An effective data loading strategy must address three key issues:
@@ -65,7 +36,16 @@ The approaches to solve these challenges depend on whether your dataset supports
65
36
Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index.<br>
66
37
In MaxText, this is best supported by the ArrayRecord format using the Grain input pipeline. This approach gracefully handles the key challenges:
67
38
***Concurrent access and uniqueness**: Grain assigns a unique set of indices to each host. ArrayRecord allows different hosts to read from different indices in the same file.
68
-
***Uneven completion**: Data indices are distributed evenly among hosts. Without packing, the data imbalance between hosts will be at most one batch. To handle the final steps where some hosts run out of data, you can enable the `generate_padding_batch_train`/`generate_padding_batch_eval` flag. This directs hosts to generate empty "padding" batches until the training or evaluation steps are met. **Note**: When sequence packing is enabled, the difference in the number of packed examples per host can be larger. The `generate_padding_batch_train`/`generate_padding_batch_eval` flag still solves this. However, as more hosts begin generating padding, you will observe a decrease in total_weights and a slower change in the training loss. If all hosts exhaust their data before the target step count is reached, both total_weights and loss will drop to 0.
39
+
40
+
***Uneven completion**: Data indices are distributed evenly among hosts. Without packing, the data imbalance between hosts will be at most one batch. To handle the final steps where some hosts run out of data, you can enable the `generate_padding_batch_train`/`generate_padding_batch_eval` flag in `src/MaxText/config/base.yml` or through command line arguments. This directs hosts to generate empty "padding" batches until the training or evaluation steps are met.
41
+
42
+
```{note}
43
+
When sequence packing is enabled, the difference in the number of packed examples per host can be larger. The `generate_padding_batch_train`/`generate_padding_batch_eval` flag still solves this.
44
+
45
+
However, as more hosts begin generating padding, you will observe a decrease in `total_weights` and a slower change in the training loss.
46
+
47
+
If all hosts exhaust their data before the target step count is reached, both `total_weights` and loss will drop to 0.
48
+
```
69
49
70
50
### Sequential access dataset
71
51
***Concurrent access and uniqueness**: Sequential-access datasets (e.g., Parquet, JSON, TFRecord) cannot be accessed by index, requiring a different strategy -- file-based sharding, where each host is given exclusive access to a specific subset of data files. **Key requirement**: `(Number of data files) % (Number of data-loading hosts) == 0`. If the file count isn't a multiple of the host count, the files will be distributed unevenly. For example, with 10 files and 8 hosts, some hosts will get two files while others get one, significantly worsening the "uneven completion" problem. If you have fewer files than hosts, performance will be severely degraded as all hosts are concurrently accessing all the files.
0 commit comments