Open
Conversation
|
@divyanshk has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98209055. |
meta-codesync bot
pushed a commit
that referenced
this pull request
Mar 26, 2026
Summary:
What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------
`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.
```
struct cudaOccResult {
int activeBlocksPerMultiprocessor; // Occupancy
unsigned int limitingFactors; // Factors that limited occupancy. A bit
// field that counts the limiting
// factors, see cudaOccLimitingFactor
int blockLimitRegs; // Occupancy due to register
// usage, INT_MAX if the kernel does not
// use any register.
int blockLimitSharedMem; // Occupancy due to shared memory
// usage, INT_MAX if the kernel does not
// use shared memory.
int blockLimitWarps; // Occupancy due to block size limit
int blockLimitBlocks; // Occupancy due to maximum number of blocks
// managable per SM
int blockLimitBarriers; // Occupancy due to block barrier usage
int allocatedRegistersPerBlock; // Actual number of registers allocated per
// block
size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
// per block
cudaOccPartitionedGCConfig partitionedGCConfig;
// Report if partitioned global caching
// is actually enabled.
};
```
The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:
```
enum cudaOccLimitingFactor {
OCC_LIMIT_WARPS = 0x01, // Block size (threads)
OCC_LIMIT_REGS = 0x02, // Register usage
OCC_LIMIT_SMEM = 0x04, // Shared memory
OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM
OCC_LIMIT_BARRIERS = 0x10 // Barrier usage
};
```
Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.
`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).
Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.
Reviewed By: scotts
Differential Revision: D98209055
b19db88 to
faa14c1
Compare
meta-codesync bot
pushed a commit
that referenced
this pull request
Mar 26, 2026
Summary:
What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------
`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.
```
struct cudaOccResult {
int activeBlocksPerMultiprocessor; // Occupancy
unsigned int limitingFactors; // Factors that limited occupancy. A bit
// field that counts the limiting
// factors, see cudaOccLimitingFactor
int blockLimitRegs; // Occupancy due to register
// usage, INT_MAX if the kernel does not
// use any register.
int blockLimitSharedMem; // Occupancy due to shared memory
// usage, INT_MAX if the kernel does not
// use shared memory.
int blockLimitWarps; // Occupancy due to block size limit
int blockLimitBlocks; // Occupancy due to maximum number of blocks
// managable per SM
int blockLimitBarriers; // Occupancy due to block barrier usage
int allocatedRegistersPerBlock; // Actual number of registers allocated per
// block
size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
// per block
cudaOccPartitionedGCConfig partitionedGCConfig;
// Report if partitioned global caching
// is actually enabled.
};
```
The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:
```
enum cudaOccLimitingFactor {
OCC_LIMIT_WARPS = 0x01, // Block size (threads)
OCC_LIMIT_REGS = 0x02, // Register usage
OCC_LIMIT_SMEM = 0x04, // Shared memory
OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM
OCC_LIMIT_BARRIERS = 0x10 // Barrier usage
};
```
Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.
`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).
Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.
Reviewed By: scotts
Differential Revision: D98209055
faa14c1 to
97af03a
Compare
divyanshk
added a commit
that referenced
this pull request
Mar 26, 2026
Summary: Pull Request resolved: #1330 What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: scotts Differential Revision: D98209055
97af03a to
b8a6d27
Compare
meta-codesync bot
pushed a commit
that referenced
this pull request
Mar 26, 2026
Summary:
What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------
`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.
```
struct cudaOccResult {
int activeBlocksPerMultiprocessor; // Occupancy
unsigned int limitingFactors; // Factors that limited occupancy. A bit
// field that counts the limiting
// factors, see cudaOccLimitingFactor
int blockLimitRegs; // Occupancy due to register
// usage, INT_MAX if the kernel does not
// use any register.
int blockLimitSharedMem; // Occupancy due to shared memory
// usage, INT_MAX if the kernel does not
// use shared memory.
int blockLimitWarps; // Occupancy due to block size limit
int blockLimitBlocks; // Occupancy due to maximum number of blocks
// managable per SM
int blockLimitBarriers; // Occupancy due to block barrier usage
int allocatedRegistersPerBlock; // Actual number of registers allocated per
// block
size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
// per block
cudaOccPartitionedGCConfig partitionedGCConfig;
// Report if partitioned global caching
// is actually enabled.
};
```
The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:
```
enum cudaOccLimitingFactor {
OCC_LIMIT_WARPS = 0x01, // Block size (threads)
OCC_LIMIT_REGS = 0x02, // Register usage
OCC_LIMIT_SMEM = 0x04, // Shared memory
OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM
OCC_LIMIT_BARRIERS = 0x10 // Barrier usage
};
```
Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.
`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).
Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.
Reviewed By: scotts
Differential Revision: D98209055
b8a6d27 to
98c4d7b
Compare
divyanshk
added a commit
that referenced
this pull request
Mar 26, 2026
Summary: Pull Request resolved: #1330 What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: scotts Differential Revision: D98209055
98c4d7b to
b99ecc5
Compare
meta-codesync bot
pushed a commit
that referenced
this pull request
Mar 27, 2026
Summary:
What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------
`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.
```
struct cudaOccResult {
int activeBlocksPerMultiprocessor; // Occupancy
unsigned int limitingFactors; // Factors that limited occupancy. A bit
// field that counts the limiting
// factors, see cudaOccLimitingFactor
int blockLimitRegs; // Occupancy due to register
// usage, INT_MAX if the kernel does not
// use any register.
int blockLimitSharedMem; // Occupancy due to shared memory
// usage, INT_MAX if the kernel does not
// use shared memory.
int blockLimitWarps; // Occupancy due to block size limit
int blockLimitBlocks; // Occupancy due to maximum number of blocks
// managable per SM
int blockLimitBarriers; // Occupancy due to block barrier usage
int allocatedRegistersPerBlock; // Actual number of registers allocated per
// block
size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
// per block
cudaOccPartitionedGCConfig partitionedGCConfig;
// Report if partitioned global caching
// is actually enabled.
};
```
The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:
```
enum cudaOccLimitingFactor {
OCC_LIMIT_WARPS = 0x01, // Block size (threads)
OCC_LIMIT_REGS = 0x02, // Register usage
OCC_LIMIT_SMEM = 0x04, // Shared memory
OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM
OCC_LIMIT_BARRIERS = 0x10 // Barrier usage
};
```
Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.
`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).
Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.
Reviewed By: fenypatel99, scotts
Differential Revision: D98209055
b99ecc5 to
dbaabdd
Compare
meta-codesync bot
pushed a commit
that referenced
this pull request
Mar 27, 2026
Summary:
What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------
`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.
```
struct cudaOccResult {
int activeBlocksPerMultiprocessor; // Occupancy
unsigned int limitingFactors; // Factors that limited occupancy. A bit
// field that counts the limiting
// factors, see cudaOccLimitingFactor
int blockLimitRegs; // Occupancy due to register
// usage, INT_MAX if the kernel does not
// use any register.
int blockLimitSharedMem; // Occupancy due to shared memory
// usage, INT_MAX if the kernel does not
// use shared memory.
int blockLimitWarps; // Occupancy due to block size limit
int blockLimitBlocks; // Occupancy due to maximum number of blocks
// managable per SM
int blockLimitBarriers; // Occupancy due to block barrier usage
int allocatedRegistersPerBlock; // Actual number of registers allocated per
// block
size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
// per block
cudaOccPartitionedGCConfig partitionedGCConfig;
// Report if partitioned global caching
// is actually enabled.
};
```
The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:
```
enum cudaOccLimitingFactor {
OCC_LIMIT_WARPS = 0x01, // Block size (threads)
OCC_LIMIT_REGS = 0x02, // Register usage
OCC_LIMIT_SMEM = 0x04, // Shared memory
OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM
OCC_LIMIT_BARRIERS = 0x10 // Barrier usage
};
```
Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.
`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).
Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.
Reviewed By: fenypatel99, scotts
Differential Revision: D98209055
dbaabdd to
b053a38
Compare
divyanshk
added a commit
that referenced
this pull request
Mar 27, 2026
Summary: Pull Request resolved: #1330 What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: fenypatel99, scotts Differential Revision: D98209055
b053a38 to
8ad34a8
Compare
Summary: Pull Request resolved: #1330 What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: fenypatel99, scotts Differential Revision: D98209055
8ad34a8 to
6d907b5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their
cuda_occupancy.hshim: D97952003, D98009788.The failure in MTIA tests were because these symbols didn't exist earlier.
cudaOccResultfromcuda_occupancy.hhas information which should help better interpret Kineto's occupany estimates.The
limitingFactorsfield is a bitmask indicating which resource(s) constrained occupancy:Fields like
blockLimitXXXshow the headroom in maximum number of blocks per SM if only XXX was the constraint.activeBlocksPerMultiprocessoris the minimum value of the allblockLimitXXXvalues. For eg, a kernel with:This means occupancy is register-limited (2 blocks instead of potential 4).
Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate
occ_resultalready, just weren't using other attributes.Reviewed By: fenypatel99, scotts
Differential Revision: D98209055