Skip to content

Expose occupany limiting factors (#1330)#1330

Open
divyanshk wants to merge 1 commit intomainfrom
export-D98209055
Open

Expose occupany limiting factors (#1330)#1330
divyanshk wants to merge 1 commit intomainfrom
export-D98209055

Conversation

@divyanshk
Copy link
Contributor

@divyanshk divyanshk commented Mar 25, 2026

Summary:

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their cuda_occupancy.h shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.

cudaOccResult from cuda_occupancy.h has information which should help better interpret Kineto's occupany estimates.

struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};

The limitingFactors field is a bitmask indicating which resource(s) constrained occupancy:

enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};

Fields like blockLimitXXX show the headroom in maximum number of blocks per SM if only XXX was the constraint.

activeBlocksPerMultiprocessor is the minimum value of the all blockLimitXXX values. For eg, a kernel with:

limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2

This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate occ_result already, just weren't using other attributes.

Reviewed By: fenypatel99, scotts

Differential Revision: D98209055

@meta-cla meta-cla bot added the cla signed label Mar 25, 2026
@meta-codesync
Copy link

meta-codesync bot commented Mar 25, 2026

@divyanshk has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98209055.

@meta-codesync meta-codesync bot changed the title Expose occupany limiting factors Expose occupany limiting factors (#1330) Mar 26, 2026
meta-codesync bot pushed a commit that referenced this pull request Mar 26, 2026
Summary:

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------

`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.

```
struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};
```

The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:

```
enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};
```


Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.

`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.

Reviewed By: scotts

Differential Revision: D98209055
meta-codesync bot pushed a commit that referenced this pull request Mar 26, 2026
Summary:

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------

`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.

```
struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};
```

The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:

```
enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};
```


Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.

`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.

Reviewed By: scotts

Differential Revision: D98209055
divyanshk added a commit that referenced this pull request Mar 26, 2026
Summary:
Pull Request resolved: #1330

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------

`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.

```
struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};
```

The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:

```
enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};
```

Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.

`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.

Reviewed By: scotts

Differential Revision: D98209055
meta-codesync bot pushed a commit that referenced this pull request Mar 26, 2026
Summary:

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------

`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.

```
struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};
```

The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:

```
enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};
```


Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.

`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.

Reviewed By: scotts

Differential Revision: D98209055
divyanshk added a commit that referenced this pull request Mar 26, 2026
Summary:
Pull Request resolved: #1330

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------

`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.

```
struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};
```

The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:

```
enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};
```

Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.

`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.

Reviewed By: scotts

Differential Revision: D98209055
meta-codesync bot pushed a commit that referenced this pull request Mar 27, 2026
Summary:

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------

`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.

```
struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};
```

The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:

```
enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};
```


Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.

`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.

Reviewed By: fenypatel99, scotts

Differential Revision: D98209055
meta-codesync bot pushed a commit that referenced this pull request Mar 27, 2026
Summary:

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------

`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.

```
struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};
```

The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:

```
enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};
```


Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.

`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.

Reviewed By: fenypatel99, scotts

Differential Revision: D98209055
divyanshk added a commit that referenced this pull request Mar 27, 2026
Summary:
Pull Request resolved: #1330

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------

`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.

```
struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};
```

The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:

```
enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};
```

Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.

`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.

Reviewed By: fenypatel99, scotts

Differential Revision: D98209055
Summary:
Pull Request resolved: #1330

What's different in this diff? Nothing. It is on top of MTIA's changes to include the right symbols
in their `cuda_occupancy.h` shim: D97952003, D98009788.
The failure in MTIA tests were because these symbols didn't exist earlier.
-----------------------------------------------------------------------------------------------------

`cudaOccResult` from `cuda_occupancy.h` has information which should help better interpret Kineto's occupany estimates.

```
struct cudaOccResult {
    int activeBlocksPerMultiprocessor; // Occupancy
    unsigned int limitingFactors;      // Factors that limited occupancy. A bit
                                       // field that counts the limiting
                                       // factors, see cudaOccLimitingFactor
    int blockLimitRegs;                // Occupancy due to register
                                       // usage, INT_MAX if the kernel does not
                                       // use any register.
    int blockLimitSharedMem;           // Occupancy due to shared memory
                                       // usage, INT_MAX if the kernel does not
                                       // use shared memory.
    int blockLimitWarps;               // Occupancy due to block size limit
    int blockLimitBlocks;              // Occupancy due to maximum number of blocks
                                       // managable per SM
    int blockLimitBarriers;            // Occupancy due to block barrier usage
    int allocatedRegistersPerBlock;    // Actual number of registers allocated per
                                       // block
    size_t allocatedSharedMemPerBlock; // Actual size of shared memory allocated
                                       // per block
    cudaOccPartitionedGCConfig partitionedGCConfig;
                                       // Report if partitioned global caching
                                       // is actually enabled.
};
```

The `limitingFactors` field is a bitmask indicating which resource(s) constrained occupancy:

```
enum cudaOccLimitingFactor {
  OCC_LIMIT_WARPS    = 0x01,  // Block size (threads)
  OCC_LIMIT_REGS     = 0x02,  // Register usage
  OCC_LIMIT_SMEM     = 0x04,  // Shared memory
  OCC_LIMIT_BLOCKS   = 0x08,  // Max blocks per SM
  OCC_LIMIT_BARRIERS = 0x10   // Barrier usage
};
```

Fields like `blockLimitXXX` show the headroom in maximum number of blocks per SM if only XXX was the constraint.

`activeBlocksPerMultiprocessor` is the minimum value of the all `blockLimitXXX` values. For eg, a kernel with:
```
limitingFactors: 3 (binary 0011 = warps + registers)
blockLimitWarps: 4
blockLimitRegs: 2
activeBlocksPerMultiprocessor: 2
```
This means occupancy is register-limited (2 blocks instead of potential 4).

Note: this is not on the hot-path of profiling, but on the post-processing side. So there should be no overhead. Plus we do populate `occ_result` already, just weren't using other attributes.

Reviewed By: fenypatel99, scotts

Differential Revision: D98209055
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant