-
Notifications
You must be signed in to change notification settings - Fork 135
Description
I've been running some benchmarks to test bandwidth on tensorstore reads, and I've found significant overhead for non-standard dtypes (e.g. bfloat16) where tensorstore seems to initialize the array to zero before overwriting it with the data. Here's the benchmark that I've been running:
import time
import tensorstore as ts
dataset = ts.open(
{
"driver": "zarr3",
"kvstore": {
"driver": "gcs",
"bucket": "...",
"path": "...",
},
"open": True,
"create": False,
}
).result()
start = time.perf_counter()
array = dataset.read().result()
duration = time.perf_counter() - start
gb = array.nbytes / 1024**3
bandwidth = gb / duration
print(f"{gb:.2f} GB loaded in {duration:.2f} s ({bandwidth:.2f} GB/s)")where the tensor that I'm loading is about 55 GB of bfloat16 data.
If I run this benchmark using the current master branch of tensorstore, I see:
54.93 GB loaded in 58.35 s (0.94 GB/s)
but, if I disable zero initialization on reads (more details below), I get:
54.93 GB loaded in 31.47 s (1.75 GB/s)
which suggests that there's about 30s of overhead introduced by the current implementations construction of the 55 GB array!
I've narrowed the offending line down to:
tensorstore/tensorstore/data_type.cc
Line 86 in 367ef7d
| r->construct(n, ptr.get()); |
I can avoid this overhead by hacking tensorstore to skip that r->construct(...) or replace this line with:
- constexpr BFloat16() : rep_(0) {}
+ BFloat16() = default;for bfloat16 specifically.
I wanted to ask here if anyone had advice about what would be the preferred approach for safely avoiding this overhead. I'd be very happy to open a PR if we decide that there's something general to do here. Thanks!!