You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG_DEV.md
+17-1Lines changed: 17 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -201,4 +201,20 @@ Tensors can be either normal Tensors or DTensors.
201
201
This PR fixes the MFU and throughput calculations by taking the dp degree into account instead of the world size. When we use parallelization strategies on top of FSDP, then the world size is different from the data parallel degree. This needs to be reflected in throughput and MFU metric calculations, as done by this PR.
202
202
203
203
**Breaking Changes**
204
-
* Existing configs need to be adapted to correctly use dp degree rather than world size.
204
+
* Existing configs need to be adapted to correctly use dp degree rather than world size.
205
+
206
+
207
+
## PR #425 Monitoring improvements
208
+
This PR improves training monitoring and logging across runs besides some other changes we did along while testing out scalability.
209
+
210
+
**General Changes**
211
+
* Configurable multi-layer FSDP units
212
+
* Option to provide experiment root path to modalities
213
+
* Added steppable profiler (e.g., for tracing of forward/backward passes)
214
+
* Fix: Hybrid sharding now correctly configurable
215
+
* Completely refactored the Profiling
216
+
* Improved error handling. Errors are now captured and stored as JSON
217
+
* Add tutorials on Einsum Transformer (Example model integration) and profiling
218
+
219
+
**Breaking Changes**
220
+
* experiments_root_path is now exposed on an API level
Copy file name to clipboardExpand all lines: docs/components/components.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,6 +40,7 @@
40
40
| scheduler | constant_lr |[ConstantLR](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ConstantLR.html#torch.optim.lr_scheduler.ConstantLR)|[ConstantLRSchedulerConfig](../../src/modalities/config/config.py)|[LRScheduler](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)| Multiplies the learning rate of each parameter group by a small constant factor until the number of steps reaches a pre-defined milestone |
41
41
| scheduler | onecycle_lr |[OneCycleLR](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html#torch.optim.lr_scheduler.OneCycleLR)|[OneCycleLRSchedulerConfig](../../src/modalities/config/config.py)|[LRScheduler](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)| Sets the learning rate of each parameter group according to the 1cycle learning rate policy. |
42
42
| scheduler | cosine_annealing_lr |[CosineAnnealingLR](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html#torch.optim.lr_scheduler.CosineAnnealingLR)|[CosineAnnealingLRSchedulerConfig](../../src/modalities/config/config.py)|[LRScheduler](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)| Set the learning rate of each parameter group using a cosine annealing schedule |
43
+
| scheduler | linear_warmup_cosine_annealing_lr |[LinearWarmupCosineAnnealingLRScheduler](../../src/modalities/optimizers/lr_schedulers.py)|[LinearWarmupCosineAnnealingLRSchedulerConfig](../../src/modalities/config/config.py)|[LRScheduler](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)| Linearly warms up to the base learning rate, then decays with cosine annealing for the remaining training steps |
0 commit comments