You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG_DEV.md
+32Lines changed: 32 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -186,3 +186,35 @@ There are now three AC variants:
186
186
* adds support for Tensor Parallelism (including Sequence Parallelism).
187
187
* adds a debugging toolkit to track the input and output tensors during a forward pass, gradients during the backward pass and weight tensors.
188
188
Tensors can be either normal Tensors or DTensors.
189
+
190
+
191
+
## PR #389 Benchmark Tooling
192
+
* adds benchmarking tooling to modalities and allows for scaling benchmarks across varying number of nodes and the cartesian product of configurable hyper parameters.
## PR #410 MFU incorporates dp_degree now instead of world_size
200
+
201
+
This PR fixes the MFU and throughput calculations by taking the dp degree into account instead of the world size. When we use parallelization strategies on top of FSDP, then the world size is different from the data parallel degree. This needs to be reflected in throughput and MFU metric calculations, as done by this PR.
202
+
203
+
**Breaking Changes**
204
+
* Existing configs need to be adapted to correctly use dp degree rather than world size.
205
+
206
+
207
+
## PR #425 Monitoring improvements
208
+
This PR improves training monitoring and logging across runs besides some other changes we did along while testing out scalability.
209
+
210
+
**General Changes**
211
+
* Configurable multi-layer FSDP units
212
+
* Option to provide experiment root path to modalities
213
+
* Added steppable profiler (e.g., for tracing of forward/backward passes)
214
+
* Fix: Hybrid sharding now correctly configurable
215
+
* Completely refactored the Profiling
216
+
* Improved error handling. Errors are now captured and stored as JSON
217
+
* Add tutorials on Einsum Transformer (Example model integration) and profiling
218
+
219
+
**Breaking Changes**
220
+
* experiments_root_path is now exposed on an API level
0 commit comments