Skip to content

Commit fb9c5fd

Browse files
authored
Merge branch 'main' into resolve_local_world_size
2 parents 23cbe8c + bf0e6d3 commit fb9c5fd

199 files changed

Lines changed: 17259 additions & 1446 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/build_and_deploy_documentation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ jobs:
2525
run: |
2626
sudo apt-get update
2727
sudo apt-get install git -y
28-
python -m pip install torch==2.6.0
28+
python -m pip install torch==2.7.1
2929
python -m pip install --upgrade pip setuptools wheel
3030
export FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE
3131
python -m pip install -e .

.github/workflows/tests_full.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,11 @@ jobs:
2323
sudo apt-get update
2424
sudo apt-get install curl -y # required by coveralls
2525
sudo apt-get install git -y
26-
python -m pip install torch==2.6.0
26+
python -m pip install torch==2.7.1
2727
python -m pip install --upgrade pip setuptools wheel
2828
export FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE
2929
python -m pip install ninja # Lowers compilation time of flash attention significantly
30-
python -m pip install flash-attn==2.7.4.post1 --no-build-isolation
30+
python -m pip install flash-attn==2.8.0.post2 --no-build-isolation
3131
python -m pip install -e .[tests]
3232
- name: Run tests
3333
run: |

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,8 +164,15 @@ tests/tmp/*
164164
*wandb_storage*
165165
.coverage/*
166166
*.pbin
167+
tutorials/scaling_up2/experiments
167168
tutorials/scaling_up/experiments
168169
tutorials/profiling/experiments
169170
tutorials/instruction_tuning/prepared_data
170171
config_files/instruction_tuning
171172
data/lorem_ipsum_instruct.jsonl
173+
tutorials/scaling_up/logs*
174+
tutorials/scaling_up/experiments_old/*
175+
results/*
176+
tutorials/einsum_transformer/experiments/*
177+
tutorials/warmstart/experiments/*
178+

CHANGELOG_DEV.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,3 +186,35 @@ There are now three AC variants:
186186
* adds support for Tensor Parallelism (including Sequence Parallelism).
187187
* adds a debugging toolkit to track the input and output tensors during a forward pass, gradients during the backward pass and weight tensors.
188188
Tensors can be either normal Tensors or DTensors.
189+
190+
191+
## PR #389 Benchmark Tooling
192+
* adds benchmarking tooling to modalities and allows for scaling benchmarks across varying number of nodes and the cartesian product of configurable hyper parameters.
193+
194+
**Breaking Changes**
195+
* Renaming: EvaluationResultToDiscSubscriberConfig.output_path -> EvaluationResultToDiscSubscriberConfig.output_file_path
196+
197+
198+
199+
## PR #410 MFU incorporates dp_degree now instead of world_size
200+
201+
This PR fixes the MFU and throughput calculations by taking the dp degree into account instead of the world size. When we use parallelization strategies on top of FSDP, then the world size is different from the data parallel degree. This needs to be reflected in throughput and MFU metric calculations, as done by this PR.
202+
203+
**Breaking Changes**
204+
* Existing configs need to be adapted to correctly use dp degree rather than world size.
205+
206+
207+
## PR #425 Monitoring improvements
208+
This PR improves training monitoring and logging across runs besides some other changes we did along while testing out scalability.
209+
210+
**General Changes**
211+
* Configurable multi-layer FSDP units
212+
* Option to provide experiment root path to modalities
213+
* Added steppable profiler (e.g., for tracing of forward/backward passes)
214+
* Fix: Hybrid sharding now correctly configurable
215+
* Completely refactored the Profiling
216+
* Improved error handling. Errors are now captured and stored as JSON
217+
* Add tutorials on Einsum Transformer (Example model integration) and profiling
218+
219+
**Breaking Changes**
220+
* experiments_root_path is now exposed on an API level

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ authors:
1414
- family-names: Rutmann
1515
given-names: Richard
1616
title: 'Modalities: A PyTorch-native framework for distributed and reproducible foundation model training.'
17-
version: 0.3.2
17+
version: 0.5.0
1818
url: https://github.com/Modalities/modalities
1919
date-released: '2024-12-02'
2020
preferred-citation:

README.md

Lines changed: 197 additions & 129 deletions
Large diffs are not rendered by default.

config_files/training/config_example_coca.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,21 +25,22 @@ settings:
2525
gradient_accumulation_steps: 1
2626
local_train_micro_batch_size: 1
2727
sequence_length: 256
28+
dp_degree: ${settings.cuda_env.world_size}
2829
training_target:
2930
num_target_tokens:
3031
component_key: number_conversion
3132
variant_key: num_tokens_from_num_steps
3233
config:
3334
num_steps: ${settings.training_target.num_target_steps}
34-
num_ranks: ${settings.cuda_env.world_size}
35+
dp_degree: ${settings.cuda_env.world_size}
3536
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
3637
sequence_length: ${settings.step_profile.sequence_length}
3738
gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
3839
num_target_steps: # for the batch progress subscriber
3940
component_key: number_conversion
4041
variant_key: num_steps_from_num_samples
4142
config:
42-
num_ranks: ${settings.cuda_env.world_size}
43+
dp_degree: ${settings.cuda_env.world_size}
4344
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
4445
global_num_samples: ${settings.coca_example_settings.train_num_samples}
4546
gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}

config_files/training/config_lorem_ipsum_long_fsdp1.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,21 +26,22 @@ settings:
2626
gradient_accumulation_steps: 2
2727
local_train_micro_batch_size: 1
2828
sequence_length: 256
29+
dp_degree: ${settings.cuda_env.world_size}
2930
training_target:
3031
num_target_tokens:
3132
component_key: number_conversion
3233
variant_key: num_tokens_from_packed_mem_map_dataset_continuous
3334
config:
3435
dataset_path: ${settings.paths.train_dataset_path}
3536
sequence_length: ${settings.step_profile.sequence_length}
36-
num_ranks: ${settings.cuda_env.world_size}
37+
dp_degree: ${settings.cuda_env.world_size}
3738
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
3839
gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
3940
num_target_steps: # for the batch progress subscriber
4041
component_key: number_conversion
4142
variant_key: num_steps_from_num_tokens
4243
config:
43-
num_ranks: ${settings.cuda_env.world_size}
44+
dp_degree: ${settings.cuda_env.world_size}
4445
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
4546
global_num_tokens: ${settings.training_target.num_target_tokens}
4647
sequence_length: ${settings.step_profile.sequence_length}

config_files/training/config_lorem_ipsum_long_fsdp1_warmstart.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,21 +26,22 @@ settings:
2626
gradient_accumulation_steps: 2
2727
local_train_micro_batch_size: 1
2828
sequence_length: 256
29+
dp_degree: ${settings.cuda_env.world_size}
2930
training_target:
3031
num_target_tokens:
3132
component_key: number_conversion
3233
variant_key: num_tokens_from_packed_mem_map_dataset_continuous
3334
config:
3435
dataset_path: ${settings.paths.train_dataset_path}
3536
sequence_length: ${settings.step_profile.sequence_length}
36-
num_ranks: ${settings.cuda_env.world_size}
37+
dp_degree: ${settings.cuda_env.world_size}
3738
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
3839
gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
3940
num_target_steps: # for the batch progress subscriber
4041
component_key: number_conversion
4142
variant_key: num_steps_from_num_tokens
4243
config:
43-
num_ranks: ${settings.cuda_env.world_size}
44+
dp_degree: ${settings.cuda_env.world_size}
4445
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
4546
global_num_tokens: ${settings.training_target.num_target_tokens}
4647
sequence_length: ${settings.step_profile.sequence_length}
@@ -67,7 +68,7 @@ settings:
6768
variant_key: last_step_from_checkpoint_path
6869
config:
6970
checkpoint_path: ${settings.warmstart_checkpoint_paths.model_checkpoint_path}
70-
warmstart_checkpoint_paths: ${warmstart_env:checkpoint_paths}
71+
warmstart_checkpoint_paths: ${warmstart_env:checkpoint_paths} # use modalities warmstart [..] --last_checkpoint_info_file_path [..]
7172

7273
collate_fn:
7374
component_key: collate_fn

config_files/training/config_lorem_ipsum_long_fsdp2.yaml

Lines changed: 40 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -18,29 +18,36 @@ settings:
1818
checkpointing_interval_in_steps: 32
1919
evaluation_interval_in_steps: 32
2020
consistency_enforcement:
21-
enforce_tokens_per_step_consistency: true
21+
enforce_tokens_per_step_consistency: false
2222
enforce_last_step_logged: false
2323
enforce_last_step_evaluated: false
2424
enforce_last_step_checkpointed: false
2525
step_profile:
2626
gradient_accumulation_steps: 1
2727
local_train_micro_batch_size: 1
2828
sequence_length: 256
29+
dp_degree:
30+
instance_key: dp_degree
31+
pass_type: BY_REFERENCE
2932
training_target:
3033
num_target_tokens:
3134
component_key: number_conversion
3235
variant_key: num_tokens_from_packed_mem_map_dataset_continuous
3336
config:
3437
dataset_path: ${settings.paths.train_dataset_path}
3538
sequence_length: ${settings.step_profile.sequence_length}
36-
num_ranks: ${settings.cuda_env.world_size}
39+
dp_degree:
40+
instance_key: dp_degree
41+
pass_type: BY_REFERENCE
3742
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
3843
gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
3944
num_target_steps: # for the batch progress subscriber
4045
component_key: number_conversion
4146
variant_key: num_steps_from_num_tokens
4247
config:
43-
num_ranks: ${settings.cuda_env.world_size}
48+
dp_degree:
49+
instance_key: dp_degree
50+
pass_type: BY_REFERENCE
4451
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
4552
global_num_tokens: ${settings.training_target.num_target_tokens}
4653
sequence_length: ${settings.step_profile.sequence_length}
@@ -64,7 +71,7 @@ train_dataset:
6471
config:
6572
raw_data_path: ${settings.paths.train_dataset_path}
6673
sequence_length: ${settings.step_profile.sequence_length}
67-
sample_key: ${settings.referencing_keys.sample_key}
74+
sample_key: ${settings.referencing_keys.sample_key}
6875

6976
train_dataloader:
7077
component_key: data_loader
@@ -172,14 +179,23 @@ device_mesh:
172179
config:
173180
device_type: cuda
174181
data_parallel_replicate_degree: 1
175-
data_parallel_shard_degree: ${settings.cuda_env.world_size} # i.e., fully sharded
182+
data_parallel_shard_degree: -1
176183
world_size: ${settings.cuda_env.world_size}
177184

185+
dp_degree:
186+
component_key: number_conversion
187+
variant_key: parallel_degree
188+
config: # get the parallel degree from the device mesh
189+
device_mesh:
190+
instance_key: device_mesh
191+
pass_type: BY_REFERENCE
192+
parallelism_methods: [dp_shard, dp_replicate]
193+
178194
app_state:
179195
component_key: app_state
180196
variant_key: raw
181197
config:
182-
model:
198+
model:
183199
instance_key: initialized_model
184200
pass_type: BY_REFERENCE
185201
optimizer:
@@ -289,7 +305,7 @@ optimizer:
289305
eps: 1e-8
290306
weight_decay: 1e-1
291307
weight_decay_groups_excluded: [embedding, layernorm]
292-
wrapped_model:
308+
wrapped_model:
293309
instance_key: initialized_model
294310
pass_type: BY_REFERENCE
295311

@@ -302,6 +318,9 @@ gradient_clipper:
302318
pass_type: BY_REFERENCE
303319
norm_type: P2_NORM
304320
max_norm: 1.0
321+
device_mesh:
322+
instance_key: device_mesh
323+
pass_type: BY_REFERENCE
305324

306325
progress_subscriber:
307326
component_key: progress_subscriber
@@ -326,17 +345,17 @@ evaluation_subscriber:
326345
directory: wandb_storage
327346
config_file_path: ${settings.config_file_path}
328347

329-
# mfu_calculator:
330-
# component_key: mfu_calculator
331-
# variant_key: gpt2
332-
# config:
333-
# n_layer: ${model_raw.config.n_layer}
334-
# sequence_length: ${settings.step_profile.sequence_length}
335-
# n_embd: ${model_raw.config.n_embd}
336-
# world_size: ${settings.cuda_env.world_size}
337-
# raw_model:
338-
# instance_key: model_raw
339-
# pass_type: BY_REFERENCE
340-
# wrapped_model:
341-
# instance_key: initialized_model
342-
# pass_type: BY_REFERENCE
348+
mfu_calculator:
349+
component_key: mfu_calculator
350+
variant_key: gpt2
351+
config:
352+
n_layer: ${model_raw.config.n_layer}
353+
sequence_length: ${settings.step_profile.sequence_length}
354+
n_embd: ${model_raw.config.n_embd}
355+
world_size: ${settings.cuda_env.world_size}
356+
wrapped_model:
357+
instance_key: initialized_model
358+
pass_type: BY_REFERENCE
359+
device_mesh:
360+
instance_key: device_mesh
361+
pass_type: BY_REFERENCE

0 commit comments

Comments
 (0)