@@ -54,13 +54,31 @@ spec:
54
54
description : |-
55
55
Flavors represents the accelerator requirements to serve the model.
56
56
Flavors are fungible following the priority represented by the slice order.
57
+ This is used both in Playground and Inference Service.
57
58
items :
58
59
description : |-
59
60
Flavor defines the accelerator requirements for a model and the necessary parameters
60
61
in autoscaling. Right now, it will be used in two places:
61
62
- Pod scheduling with node selectors specified.
62
63
- Cluster autoscaling with essential parameters provided.
63
64
properties :
65
+ limits :
66
+ additionalProperties :
67
+ anyOf :
68
+ - type : integer
69
+ - type : string
70
+ pattern : ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
71
+ x-kubernetes-int-or-string : true
72
+ description : |-
73
+ Limits defines the required accelerators to serve the model for each replica,
74
+ like <nvidia.com/gpu: 8>. For multi-hosts cases, the limits here indicates
75
+ the resource requirements for each replica, usually equals to the TP size.
76
+ Not recommended to set the cpu and memory usage here:
77
+ - if using playground, you can define the cpu/mem usage at backendConfig.
78
+ - if using inference service, you can define the cpu/mem at the container resources.
79
+ However, if you define the same accelerator resources at playground/service as well,
80
+ the resources will be overwritten by the flavor limit here.
81
+ type : object
64
82
name :
65
83
description : Name represents the flavor name, which will
66
84
be used in model claim.
@@ -83,23 +101,6 @@ spec:
83
101
with <INSTANCE-TYPE: p4d.24xlarge> for AWS.
84
102
Preset parameters: TP, PP, INSTANCE-TYPE.
85
103
type : object
86
- requests :
87
- additionalProperties :
88
- anyOf :
89
- - type : integer
90
- - type : string
91
- pattern : ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
92
- x-kubernetes-int-or-string : true
93
- description : |-
94
- Requests defines the required accelerators to serve the model for each replica,
95
- like <nvidia.com/gpu: 8>. For multi-hosts cases, the requests here indicates
96
- the resource requirements for each replica, usually equals to the TP size.
97
- Not recommended to set the cpu and memory usage here:
98
- - if using playground, you can define the cpu/mem usage at backendConfig.
99
- - if using inference service, you can define the cpu/mem at the container resources.
100
- However, if you define the same accelerator requests at playground/service as well,
101
- the requests will be overwritten by the flavor requests.
102
- type : object
103
104
required :
104
105
- name
105
106
type : object
@@ -112,6 +113,8 @@ spec:
112
113
description : |-
113
114
SharedMemorySize represents the size of /dev/shm required in the runtime of
114
115
inference workload.
116
+ This is only used in Playground. Inference Service can configure the shared memory
117
+ directly in PodSpec.
115
118
pattern : ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
116
119
x-kubernetes-int-or-string : true
117
120
type : object
0 commit comments