@@ -2482,6 +2482,8 @@ This version of the operator has been available since version 1 of the 'com.micr
2482
2482
<dd >Rotate using interleaved pattern. Default value is 0 (False).</dd >
2483
2483
<dt ><tt >scale</tt > : float</dt >
2484
2484
<dd >Custom scale will be used if specified. Default value is 1/sqrt(head_size)</dd >
2485
+ <dt ><tt >smooth_softmax</tt > : int</dt >
2486
+ <dd >Use a smooth factor in softmax.</dd >
2485
2487
</dl >
2486
2488
2487
2489
#### Inputs (7 - 9)
@@ -3022,6 +3024,8 @@ This version of the operator has been available since version 1 of the 'com.micr
3022
3024
<dd >Number of top experts to select from expert pool</dd >
3023
3025
<dt ><tt >normalize_routing_weights</tt > : int</dt >
3024
3026
<dd >Whether to normalize routing weights</dd >
3027
+ <dt ><tt >use_sparse_mixer</tt > : int</dt >
3028
+ <dd >Whether to use sparse mixer</dd >
3025
3029
</dl >
3026
3030
3027
3031
#### Inputs (5 - 8)
@@ -4337,7 +4341,7 @@ This version of the operator has been available since version 1 of the 'com.micr
4337
4341
4338
4342
### <a name =" com.microsoft.QMoE " ></a ><a name =" com.microsoft.qmoe " >** com.microsoft.QMoE** </a >
4339
4343
4340
- Int4 MoE
4344
+ Quantized MoE
4341
4345
4342
4346
#### Version
4343
4347
@@ -4348,10 +4352,14 @@ This version of the operator has been available since version 1 of the 'com.micr
4348
4352
<dl >
4349
4353
<dt ><tt >activation_type</tt > : string</dt >
4350
4354
<dd >Activation function to use. Choose from relu, gelu, silu and identity. Default is relu</dd >
4355
+ <dt ><tt >expert_weight_bits</tt > : int</dt >
4356
+ <dd >Number of bits used in quantized weights. Default is 4 bits</dd >
4351
4357
<dt ><tt >k</tt > : int</dt >
4352
4358
<dd >Number of top experts to select from expert pool</dd >
4353
4359
<dt ><tt >normalize_routing_weights</tt > : int</dt >
4354
4360
<dd >Whether to normalize routing weights</dd >
4361
+ <dt ><tt >use_sparse_mixer</tt > : int</dt >
4362
+ <dd >Whether to use sparse mixer</dd >
4355
4363
</dl >
4356
4364
4357
4365
#### Inputs (7 - 11)
@@ -4362,19 +4370,19 @@ This version of the operator has been available since version 1 of the 'com.micr
4362
4370
<dt ><tt >router_probs</tt > : T</dt >
4363
4371
<dd >2D input tensor with shape (num_rows, num_experts)</dd >
4364
4372
<dt ><tt >fc1_experts_weights</tt > : T1</dt >
4365
- <dd >3D input tensor with shape (num_experts, hidden_size, inter_size / 2)</dd >
4373
+ <dd >3D input tensor with shape (num_experts, hidden_size, inter_size) or (num_experts, hidden_size, inter_size / 2)</dd >
4366
4374
<dt ><tt >fc1_scales</tt > : T</dt >
4367
4375
<dd >2D input tensor with shape (num_experts, inter_size)</dd >
4368
4376
<dt ><tt >fc1_experts_bias</tt > (optional) : T</dt >
4369
4377
<dd >2D optional input tensor with shape (num_experts, inter_size)</dd >
4370
4378
<dt ><tt >fc2_experts_weights</tt > : T1</dt >
4371
- <dd >3D input tensor with shape (num_experts, inter_size, hidden_size / 2)</dd >
4379
+ <dd >3D input tensor with shape (num_experts, inter_size, hidden_size) or (num_experts, inter_size, hidden_size / 2)</dd >
4372
4380
<dt ><tt >fc2_scales</tt > : T</dt >
4373
4381
<dd >2D input tensor with shape (num_experts, hidden_size)</dd >
4374
4382
<dt ><tt >fc2_experts_bias</tt > (optional) : T</dt >
4375
4383
<dd >2D optional input tensor with shape (num_experts, hidden_size)</dd >
4376
4384
<dt ><tt >fc3_experts_weights</tt > (optional) : T1</dt >
4377
- <dd >3D optional input tensor with shape (num_experts, hidden_size, inter_size / 2)</dd >
4385
+ <dd >3D optional input tensor with shape (num_experts, hidden_size, inter_size) or (num_experts, hidden_size, inter_size / 2)</dd >
4378
4386
<dt ><tt >fc3_scales</tt > (optional) : T</dt >
4379
4387
<dd >2D optional input tensor with shape (num_experts, inter_size)</dd >
4380
4388
<dt ><tt >fc3_experts_bias</tt > (optional) : T</dt >
0 commit comments