Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Consider privatization instead of forwarding in fusion segmentation #3832

Open
naoyam opened this issue Feb 5, 2025 · 0 comments
Labels
Segmentation Issues related to nvFuser Segmentation

Comments

@naoyam
Copy link
Collaborator

naoyam commented Feb 5, 2025

Input forwarding helps segmentation avoid unnecessary intermediate tensors, but it also hides those forwarded ops from the segmenter and schedulers, which can result in performance issues.

For example, if a cast op from bf16 to fp32 is forwarded, the normalization schedulers would need to assume a 2x larger persistent buffer because it wouldn't do the input projection.

Here's a repro with the inner normalization scheduler. Run this as SEG=1 S=$((64 * 1024)) NVFUSER_DUMP=segmented_fusion ./bin/nvfuser_tests --gtest_filter='*FowardingMiss*. The S parameter needs to be adjusted depending on the actual GPU. Choose the size that would fit in the shared memory if the data type is bfloat16 but not with float, which is 64K with RTX 6000.

TEST_F(SegmentationTest, FowardingMissProjectionToLowerPrecisionInput) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = makeSymbolicTensor(2, DataType::BFloat16);
  fusion.addInput(tv0);

  auto tv1 = castOp(DataType::Float, tv0);
  auto tv2 = set(tv1);
  auto tv3 = sum(tv2, {1});
  auto tv4 = broadcast(tv3, {false, true});
  auto tv5 = add(tv2, tv4);
  auto tv6 = castOp(DataType::BFloat16, tv5);
  fusion.addOutput(tv6);

  // Forces segmentation
  if (getenv("SEG")) {
    auto tv7 = makeSymbolicTensor(1, DataType::BFloat16);
    fusion.addInput(tv7);
    fusion.addOutput(segment_set(tv7));
  }

  int64_t size = atoi(getenv("S"));
  auto options = at::TensorOptions().dtype(at::kBFloat16).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({128, size}, options);
  std::vector<c10::IValue> inputs = {t0};
  if (getenv("SEG")) {
    at::Tensor t1 = at::randn({10}, options);
    inputs.emplace_back(t1);
  }

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs(inputs);
  testValidate(&fusion, outputs, inputs, __LINE__, __FILE__);
}
Segmented_Fusion{
groups:
  no_op{6}
  reduction{0, 1, 2, 3}
  transpose{4, 5}
edges:
  e{ reduction{0, 1, 2, 3} -> transpose{4, 5}(T4_g_float[iS8{i0}, bS9{1}]) }
  e{ reduction{0, 1, 2, 3} -> transpose{4, 5}(T2_g_float[iS4{i0}, iS5{i2}]) }

As shown above, the normalization scheduler is not used, which is because the size of the buffer in DataType::Float is too large. We would expect it should use the bfloat16 input as the persistent buffer, but that doesn't happen in this case since the cast op is forwarded and thus hidden from the scheduler.

This isn't an issue if the segmentation step is avoided. If SEG=1 is omitted, the fusion is indeed scheduled as a inner persistent kernel without segmentation.

This seems like a fundamental issue with the forwarding approach since it hides actual ops from the schedulers.

The privatization approach introduced in #3776 should not have this problem and should be able to provide the same benefits if extended.

@naoyam naoyam added the Segmentation Issues related to nvFuser Segmentation label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Segmentation Issues related to nvFuser Segmentation
Projects
None yet
Development

No branches or pull requests

1 participant