Expose occupany limiting factors (#1330) by divyanshk · Pull Request #1330 · pytorch/kineto

divyanshk · 2026-03-25T21:58:42Z

Summary:

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.

cudaOccResult from cuda_occupancy.h has information which should help better interpret Kineto's occupany estimates.

struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};

The limitingFactors field is a bitmask indicating which resource(s) constrained occupancy:

enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};

Fields like blockLimitXXX show the headroom in maximum number of blocks per SM if only XXX was the constraint.

activeBlocksPerMultiprocessor is the minimum value of the all blockLimitXXX values. For eg, a kernel with:

limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2

This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate occ_result already, just weren't using other attributes.

Reviewed By: fenypatel99, scotts

Differential Revision: D98209055

meta-codesync · 2026-03-25T21:58:56Z

@divyanshk has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98209055.

Summary: What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: scotts Differential Revision: D98209055

Summary: Pull Request resolved: #1330 What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: scotts Differential Revision: D98209055

Summary: What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: scotts Differential Revision: D98209055

Summary: Pull Request resolved: #1330 What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: scotts Differential Revision: D98209055

Summary: What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: fenypatel99, scotts Differential Revision: D98209055

Summary: Pull Request resolved: #1330 What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their `cuda_occupancy.h` shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier. ----------------------------------------------------------------------------------------------------- `cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates. ``` struct cudaOccResult { int activeBlocksPerMultiprocessor; // Occupancy unsigned int limitingFactors; // Factors that limited occupancy. A bit // field that counts the limiting // factors, see cudaOccLimitingFactor int blockLimitRegs; // Occupancy due to register // usage, INT_MAX if the kernel does not // use any register. int blockLimitSharedMem; // Occupancy due to shared memory // usage, INT_MAX if the kernel does not // use shared memory. int blockLimitWarps; // Occupancy due to block size limit int blockLimitBlocks; // Occupancy due to maximum number of blocks // managable per SM int blockLimitBarriers; // Occupancy due to block barrier usage int allocatedRegistersPerBlock; // Actual number of registers allocated per // block size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated // per block cudaOccPartitionedGCConfig partitionedGCConfig; // Report if partitioned global caching // is actually enabled. }; ``` The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy: ``` enum cudaOccLimitingFactor { OCC_LIMIT_WARPS = 0x01, // Block size (threads) OCC_LIMIT_REGS = 0x02, // Register usage OCC_LIMIT_SMEM = 0x04, // Shared memory OCC_LIMIT_BLOCKS = 0x08, // Max blocks per SM OCC_LIMIT_BARRIERS = 0x10 // Barrier usage }; ``` Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint. `activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with: ``` limitingFactors: 3 (binary 0011 = warps + registers) blockLimitWarps: 4 blockLimitRegs: 2 activeBlocksPerMultiprocessor: 2 ``` This means occupancy is register-limited (2 blocks instead of potential 4). Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes. Reviewed By: fenypatel99, scotts Differential Revision: D98209055

meta-cla bot added the cla signed label Mar 25, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 25, 2026

meta-codesync bot changed the title ~~Expose occupany limiting factors~~ Expose occupany limiting factors (#1330) Mar 26, 2026

meta-codesync bot force-pushed the export-D98209055 branch from b19db88 to faa14c1 Compare March 26, 2026 17:32

meta-codesync bot force-pushed the export-D98209055 branch from faa14c1 to 97af03a Compare March 26, 2026 17:36

divyanshk force-pushed the export-D98209055 branch from 97af03a to b8a6d27 Compare March 26, 2026 17:36

meta-codesync bot force-pushed the export-D98209055 branch from b8a6d27 to 98c4d7b Compare March 26, 2026 17:41

divyanshk force-pushed the export-D98209055 branch from 98c4d7b to b99ecc5 Compare March 26, 2026 17:45

meta-codesync bot force-pushed the export-D98209055 branch from b99ecc5 to dbaabdd Compare March 27, 2026 01:21

meta-codesync bot force-pushed the export-D98209055 branch from dbaabdd to b053a38 Compare March 27, 2026 01:23

divyanshk force-pushed the export-D98209055 branch from b053a38 to 8ad34a8 Compare March 27, 2026 01:24

divyanshk force-pushed the export-D98209055 branch from 8ad34a8 to 6d907b5 Compare March 27, 2026 01:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose occupany limiting factors (#1330)#1330

Expose occupany limiting factors (#1330)#1330
divyanshk wants to merge 1 commit intomainfrom
export-D98209055

divyanshk commented Mar 25, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

divyanshk commented Mar 25, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols in their cuda_occupancy.h shim: D97952003, D98009788. The failure in MTIA tests were because these symbols didn't exist earlier.

Uh oh!

meta-codesync bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

divyanshk commented Mar 25, 2026 •

edited by meta-codesync bot

Loading

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.