|
19 | 19 |
|
20 | 20 | Currently MaxText has three data input pipelines: |
21 | 21 |
|
22 | | -| Pipeline | Dataset formats | Features | Limitations | |
23 | | -| -------- | --------------- | -------- | ----------- | |
24 | | -| **[Grain](data_input_pipeline/data_input_grain.md)** (recommended)| [ArrayRecord](https://github.com/google/array_record) (random access, available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview), or [conversion](https://github.com/google/array_record/tree/main/beam))<br>[Parquet](https://arrow.apache.org/docs/python/parquet.html) (sequential access) | With arrayrecord: fully deterministic, resilient to preemption; global shuffle <br>With parquet: performant; fully deterministic, resilient to preemption; hierarchical shuffle | | |
25 | | -| **[Hugging Face](data_input_pipeline/data_input_hf.md)** | datasets in [Hugging Face Hub](https://huggingface.co/datasets)<br>local/Cloud Storage datasets in json, parquet, arrow, csv, txt (sequential access) | no download needed, convenience; <br>multiple formats | limit scalability using the Hugging Face Hub (no limit using Cloud Storage); <br>non-deterministic with preemption<br>(deterministic without preemption)<br> | |
26 | | -| **[TFDS](data_input_pipeline/data_input_hf.md)** | TFRecord (sequential access), available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview) | performant | only supports TFRecords; <br>non-deterministic with preemption<br>(deterministic without preemption) | |
| 22 | +::::{grid} 1 2 2 2 |
| 23 | +:gutter: 2 |
| 24 | + |
| 25 | +:::{grid-item-card} 🌾 Grain (Recommended) |
| 26 | +:link: data_input_pipeline/data_input_grain |
| 27 | +:link-type: doc |
| 28 | + |
| 29 | +**Features**: With arrayrecord: fully deterministic, resilient to preemption; global shuffle. With parquet: performant; fully deterministic, resilient to preemption; hierarchical shuffle. |
| 30 | + |
| 31 | +**Formats**: ArrayRecord, Parquet. |
| 32 | +::: |
| 33 | + |
| 34 | +:::{grid-item-card} 🤗 Hugging Face |
| 35 | +:link: data_input_pipeline/data_input_hf |
| 36 | +:link-type: doc |
| 37 | + |
| 38 | +**Formats**: Hugging Face Hub, local/GCS (json, parquet, arrow, csv, txt). |
| 39 | +::: |
| 40 | + |
| 41 | +:::{grid-item-card} 💾 TFDS |
| 42 | +:link: data_input_pipeline/data_input_hf |
| 43 | +:link-type: doc |
| 44 | + |
| 45 | +**Formats**: TFRecord (via Tensorflow Datasets). |
| 46 | +::: |
| 47 | + |
| 48 | +:::{grid-item-card} ⚡ Optimizing Performance |
| 49 | +:link: data_input_pipeline/data_pipeline_perf |
| 50 | +:link-type: doc |
| 51 | + |
| 52 | +Guide to setting and verifying performance goals to maximize accelerator utilization. |
| 53 | +::: |
| 54 | + |
| 55 | +:::: |
27 | 56 |
|
28 | 57 | ## Multihost dataloading best practice |
29 | 58 | Training in a multi-host environment presents unique challenges for data input pipelines. An effective data loading strategy must address three key issues: |
|
0 commit comments