Skip to content
This repository was archived by the owner on Aug 7, 2024. It is now read-only.

Commit 38c02fe

Browse files
vkuzofacebook-github-bot
authored andcommitted
update readme (#317)
Summary: Pull Request resolved: #317 cleaning up the readme to reflect latest changes Reviewed By: drisspg Differential Revision: D59827460 fbshipit-source-id: aba3d31c6087ddfbf1892b86e31e058569770c50
1 parent de93990 commit 38c02fe

File tree

1 file changed

+18
-15
lines changed

1 file changed

+18
-15
lines changed

README.md

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@
22

33
This is an early version of a library for accelerating training with float8 in native PyTorch
44
according to the recipes laid out in https://arxiv.org/pdf/2209.05433.pdf.
5-
The codebase strives to stay small, easily hackable, and debuggable with native PyTorch tooling.
6-
``torch.compile`` is supported out of the box. With ``torch.compile`` on, initial results show
5+
The codebase strives to stay small, easily hackable, debuggable with native PyTorch tooling,
6+
and composable with key systems such as autograd, ```torch.compile``` and distributed.
7+
With ``torch.compile`` on, initial results show
78
throughput speedups of up to 1.2x on small scale (8 GPUs) LLaMa pretraining jobs.
89

9-
:warning: <em>See the [feature tracker](https://github.com/pytorch-labs/float8_experimental/issues/187) for upcoming features. Key features such as weight cast recomputation in backward and large scale distributed support are not ready yet. </em>
10+
:warning: <em>See the [feature tracker](https://github.com/pytorch-labs/float8_experimental/issues/187) for upcoming features.</em>
1011

1112
:warning: <em>Backwards compatibility is not guaranteed at this point. The codebase is in active development and
1213
will change rapidly.</em>
@@ -25,7 +26,7 @@ pip install -e .
2526
pip install -e ".[dev]"
2627
```
2728

28-
# User API
29+
# Single GPU User API
2930

3031
We provide two per-tensor scaling strategies: dynamic and delayed. See https://arxiv.org/pdf/2209.05433.pdf, Section 4.3 for more details. These strategies are configurable separately for activations (`x`), weights (`w`) and gradients (`dL_dY`).
3132

@@ -113,30 +114,32 @@ for _ in range(N_ITER):
113114
optimizer.step()
114115
```
115116

116-
# 🧭 Code Organization
117+
# Multi GPU User API
117118

118-
* `float8_experimental/float8_linear.py`
119-
- `Float8Linear` (main user facing entry point for Float8Linear)
120-
* `float8_experimental/float8_tensor.py`
121-
- `Float8Tensor`, which allows `Float8Linear` to abide by the `x.dtype == x.grad.dtype` restriction
122-
- `ScaledMMConfig` defines the semantics for matmul in the forward and backwards pass
119+
We compose with the `DTensor` based [distributed APIs](https://pytorch.org/docs/stable/distributed.tensor.parallel.html),
120+
such as FSDP, TP and SP. Please see the [torchtitan](https://github.com/pytorch/torchtitan) repository for e2e examples
121+
on using `float8_experimental` in a distributed setting.
123122

124123
# Testing
125124

126125
```bash
127126
# run single-GPU unit tests
128127
pytest test/test_base.py
129128

130-
# run a single-GPU integration test on SAM
131-
pytest test/test_sam.py
132-
133129
# run single-GPU compile tests
134130
pytest test/test_compile.py
131+
132+
# run single-GPU numerics integration tests
133+
pytest test/test_numerics_integration.py
134+
135135
# run a two-GPU integration test on FSDP
136136
./test/test_fsdp.sh
137137

138-
# run integration tests for TP/SP (outdated)
139-
./test/test_tp.sh
138+
# run integration tests on the DTensor TP/SP integration
139+
./test/test_dtensor.sh
140+
141+
# run integration tests on the FSDP2 integration
142+
python test/test_fsdp2/test_fsdp2_eager.py
140143

141144
# run all of these tests
142145
./test/test_everything.sh

0 commit comments

Comments
 (0)