You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Input forwarding helps segmentation avoid unnecessary intermediate tensors, but it also hides those forwarded ops from the segmenter and schedulers, which can result in performance issues.
For example, if a cast op from bf16 to fp32 is forwarded, the normalization schedulers would need to assume a 2x larger persistent buffer because it wouldn't do the input projection.
Here's a repro with the inner normalization scheduler. Run this as SEG=1 S=$((64 * 1024)) NVFUSER_DUMP=segmented_fusion ./bin/nvfuser_tests --gtest_filter='*FowardingMiss*. The S parameter needs to be adjusted depending on the actual GPU. Choose the size that would fit in the shared memory if the data type is bfloat16 but not with float, which is 64K with RTX 6000.
TEST_F(SegmentationTest, FowardingMissProjectionToLowerPrecisionInput) {
std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
Fusion& fusion = *fusion_ptr.get();
FusionGuard fg(&fusion);
auto tv0 = makeSymbolicTensor(2, DataType::BFloat16);
fusion.addInput(tv0);
auto tv1 = castOp(DataType::Float, tv0);
auto tv2 = set(tv1);
auto tv3 = sum(tv2, {1});
auto tv4 = broadcast(tv3, {false, true});
auto tv5 = add(tv2, tv4);
auto tv6 = castOp(DataType::BFloat16, tv5);
fusion.addOutput(tv6);
// Forces segmentation
if (getenv("SEG")) {
auto tv7 = makeSymbolicTensor(1, DataType::BFloat16);
fusion.addInput(tv7);
fusion.addOutput(segment_set(tv7));
}
int64_t size = atoi(getenv("S"));
auto options = at::TensorOptions().dtype(at::kBFloat16).device(at::kCUDA, 0);
at::Tensor t0 = at::randn({128, size}, options);
std::vector<c10::IValue> inputs = {t0};
if (getenv("SEG")) {
at::Tensor t1 = at::randn({10}, options);
inputs.emplace_back(t1);
}
FusionExecutorCache executor_cache(std::move(fusion_ptr));
auto outputs = executor_cache.runFusionWithInputs(inputs);
testValidate(&fusion, outputs, inputs, __LINE__, __FILE__);
}
As shown above, the normalization scheduler is not used, which is because the size of the buffer in DataType::Float is too large. We would expect it should use the bfloat16 input as the persistent buffer, but that doesn't happen in this case since the cast op is forwarded and thus hidden from the scheduler.
This isn't an issue if the segmentation step is avoided. If SEG=1 is omitted, the fusion is indeed scheduled as a inner persistent kernel without segmentation.
This seems like a fundamental issue with the forwarding approach since it hides actual ops from the schedulers.
The privatization approach introduced in #3776 should not have this problem and should be able to provide the same benefits if extended.
The text was updated successfully, but these errors were encountered:
Input forwarding helps segmentation avoid unnecessary intermediate tensors, but it also hides those forwarded ops from the segmenter and schedulers, which can result in performance issues.
For example, if a cast op from bf16 to fp32 is forwarded, the normalization schedulers would need to assume a 2x larger persistent buffer because it wouldn't do the input projection.
Here's a repro with the inner normalization scheduler. Run this as
SEG=1 S=$((64 * 1024)) NVFUSER_DUMP=segmented_fusion ./bin/nvfuser_tests --gtest_filter='*FowardingMiss*
. TheS
parameter needs to be adjusted depending on the actual GPU. Choose the size that would fit in the shared memory if the data type is bfloat16 but not with float, which is 64K with RTX 6000.As shown above, the normalization scheduler is not used, which is because the size of the buffer in DataType::Float is too large. We would expect it should use the bfloat16 input as the persistent buffer, but that doesn't happen in this case since the cast op is forwarded and thus hidden from the scheduler.
This isn't an issue if the segmentation step is avoided. If
SEG=1
is omitted, the fusion is indeed scheduled as a inner persistent kernel without segmentation.This seems like a fundamental issue with the forwarding approach since it hides actual ops from the schedulers.
The privatization approach introduced in #3776 should not have this problem and should be able to provide the same benefits if extended.
The text was updated successfully, but these errors were encountered: