@@ -47,22 +47,23 @@ spec:
47
47
properties :
48
48
args :
49
49
description : |-
50
- Args represents the specified arguments of the backendRuntime,
51
- will be append to the backendRuntime.spec.Args.
52
- properties :
53
- flags :
54
- description : |-
55
- Flags represents all the preset configurations.
56
- Flag around with {{ .CONFIG }} is a configuration waiting for render.
57
- items :
58
- type : string
59
- type : array
60
- name :
61
- default : default
62
- description : Name represents the identifier of the backendRuntime
63
- argument.
64
- type : string
65
- type : object
50
+ Args represents all the arguments for the command.
51
+ Argument around with {{ .CONFIG }} is a configuration waiting for render.
52
+ Args defined here will "append" the args in the recommendedConfig.
53
+ items :
54
+ type : string
55
+ type : array
56
+ backendName :
57
+ default : vllm
58
+ description : BackendName represents the inference backend under
59
+ the hood, e.g. vLLM.
60
+ type : string
61
+ configName :
62
+ description : |-
63
+ ConfigName represents the recommended configuration name for the backend,
64
+ It will be inferred from the models in the runtime if not specified, e.g. default,
65
+ speculative-decoding or model-parallelism.
66
+ type : string
66
67
envs :
67
68
description : Envs represents the environments set to the container.
68
69
items :
@@ -183,16 +184,12 @@ spec:
183
184
- name
184
185
type : object
185
186
type : array
186
- name :
187
- default : vllm
188
- description : Name represents the inference backend under the hood,
189
- e.g. vLLM.
190
- type : string
191
187
resources :
192
188
description : |-
193
189
Resources represents the resource requirements for backend, like cpu/mem,
194
190
accelerators like GPU should not be defined here, but at the model flavors,
195
191
or the values here will be overwritten.
192
+ Resources defined here will "overwrite" the resources in the recommendedConfig.
196
193
properties :
197
194
limits :
198
195
additionalProperties :
@@ -219,38 +216,11 @@ spec:
219
216
More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
220
217
type : object
221
218
type : object
222
- version :
223
- description : |-
224
- Version represents the backend version if you want a different one
225
- from the default version.
226
- type : string
227
- type : object
228
- elasticConfig :
229
- description : |-
230
- ElasticConfig defines the configuration for elastic usage,
231
- e.g. the max/min replicas.
232
- Note: this requires to install the HPA first or will report error.
233
- properties :
234
- maxReplicas :
235
- description : |-
236
- MaxReplicas indicates the maximum number of inference workloads based on the traffic.
237
- Default to nil means there's no limit for the instance number.
238
- format : int32
239
- type : integer
240
- minReplicas :
241
- default : 1
242
- description : |-
243
- MinReplicas indicates the minimum number of inference workloads based on the traffic.
244
- Default to 1.
245
- MinReplicas couldn't be 0 now, will support serverless in the future.
246
- format : int32
247
- type : integer
248
219
scaleTrigger :
249
220
description : |-
250
- ScaleTrigger defines a set of triggers to scale the workloads.
251
- If not defined, trigger configured in backendRuntime will be used,
252
- otherwise, trigger defined here will overwrite the defaulted ones.
253
- ScaleTriggerRef and ScaleTrigger can't be set at the same time.
221
+ ScaleTrigger defines the rules to scale the workloads.
222
+ Only one trigger cloud work at a time, mostly used in Playground.
223
+ ScaleTrigger defined here will "overwrite" the scaleTrigger in the recommendedConfig.
254
224
properties :
255
225
hpa :
256
226
description : HPA represents the trigger configuration of the
@@ -859,19 +829,41 @@ spec:
859
829
type : array
860
830
type : object
861
831
type : object
862
- scaleTriggerRef :
832
+ sharedMemorySize :
833
+ anyOf :
834
+ - type : integer
835
+ - type : string
863
836
description : |-
864
- ScaleTriggerRef refers to the configured scaleTrigger in the backendRuntime
865
- with tuned target value.
866
- ScaleTriggerRef and ScaleTrigger can't be set at the same time.
867
- properties :
868
- name :
869
- description : Name represents the scale trigger name defined
870
- in the backendRuntime.scaleTriggers.
871
- type : string
872
- required :
873
- - name
874
- type : object
837
+ SharedMemorySize represents the size of /dev/shm required in the runtime of
838
+ inference workload.
839
+ SharedMemorySize defined here will "overwrite" the sharedMemorySize in the recommendedConfig.
840
+ pattern : ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
841
+ x-kubernetes-int-or-string : true
842
+ version :
843
+ description : |-
844
+ Version represents the backend version if you want a different one
845
+ from the default version.
846
+ type : string
847
+ type : object
848
+ elasticConfig :
849
+ description : |-
850
+ ElasticConfig defines the configuration for elastic usage,
851
+ e.g. the max/min replicas.
852
+ properties :
853
+ maxReplicas :
854
+ description : |-
855
+ MaxReplicas indicates the maximum number of inference workloads based on the traffic.
856
+ Default to nil means there's no limit for the instance number.
857
+ format : int32
858
+ type : integer
859
+ minReplicas :
860
+ default : 1
861
+ description : |-
862
+ MinReplicas indicates the minimum number of inference workloads based on the traffic.
863
+ Default to 1.
864
+ MinReplicas couldn't be 0 now, will support serverless in the future.
865
+ format : int32
866
+ type : integer
875
867
type : object
876
868
modelClaim :
877
869
description : |-
0 commit comments