Skip to content

Latest commit

 

History

History
721 lines (555 loc) · 20.2 KB

inference.v1alpha1.md

File metadata and controls

721 lines (555 loc) · 20.2 KB
title content_type package auto_generated description
llmaz inference API
tool-reference
inference.llmaz.io/v1alpha1
true
Generated API reference documentation for inference.llmaz.io/v1alpha1.

Resource Types

Playground {#inference-llmaz-io-v1alpha1-Playground}

Appears in:

Playground is the Schema for the playgrounds API

FieldDescription
apiVersion
string
inference.llmaz.io/v1alpha1
kind
string
Playground
spec [Required]
PlaygroundSpec
No description provided.
status [Required]
PlaygroundStatus
No description provided.

Service {#inference-llmaz-io-v1alpha1-Service}

Appears in:

Service is the Schema for the services API

FieldDescription
apiVersion
string
inference.llmaz.io/v1alpha1
kind
string
Service
spec [Required]
ServiceSpec
No description provided.
status [Required]
ServiceStatus
No description provided.

BackendName {#inference-llmaz-io-v1alpha1-BackendName}

(Alias of string)

Appears in:

BackendRuntime {#inference-llmaz-io-v1alpha1-BackendRuntime}

Appears in:

BackendRuntime is the Schema for the backendRuntime API

FieldDescription
spec [Required]
BackendRuntimeSpec
No description provided.
status [Required]
BackendRuntimeStatus
No description provided.

BackendRuntimeConfig {#inference-llmaz-io-v1alpha1-BackendRuntimeConfig}

Appears in:

FieldDescription
backendName
BackendName

BackendName represents the inference backend under the hood, e.g. vLLM.

version
string

Version represents the backend version if you want a different one from the default version.

envs
[]k8s.io/api/core/v1.EnvVar

Envs represents the environments set to the container.

configName [Required]
string

ConfigName represents the recommended configuration name for the backend, It will be inferred from the models in the runtime if not specified, e.g. default, speculative-decoding.

args
[]string

Args defined here will "append" the args defined in the recommendedConfig, either explicitly configured in configName or inferred in the runtime.

resources
ResourceRequirements

Resources represents the resource requirements for backend, like cpu/mem, accelerators like GPU should not be defined here, but at the model flavors, or the values here will be overwritten. Resources defined here will "overwrite" the resources in the recommendedConfig.

sharedMemorySize
k8s.io/apimachinery/pkg/api/resource.Quantity

SharedMemorySize represents the size of /dev/shm required in the runtime of inference workload. SharedMemorySize defined here will "overwrite" the sharedMemorySize in the recommendedConfig.

BackendRuntimeSpec {#inference-llmaz-io-v1alpha1-BackendRuntimeSpec}

Appears in:

BackendRuntimeSpec defines the desired state of BackendRuntime

FieldDescription
command
[]string

Command represents the default command for the backendRuntime.

image [Required]
string

Image represents the default image registry of the backendRuntime. It will work together with version to make up a real image.

version [Required]
string

Version represents the default version of the backendRuntime. It will be appended to the image as a tag.

envs
[]k8s.io/api/core/v1.EnvVar

Envs represents the environments set to the container.

lifecycle
k8s.io/api/core/v1.Lifecycle

Lifecycle represents hooks executed during the lifecycle of the container.

livenessProbe
k8s.io/api/core/v1.Probe

Periodic probe of backend liveness. Backend will be restarted if the probe fails. Cannot be updated.

readinessProbe
k8s.io/api/core/v1.Probe

Periodic probe of backend readiness. Backend will be removed from service endpoints if the probe fails.

startupProbe
k8s.io/api/core/v1.Probe

StartupProbe indicates that the Backend has successfully initialized. If specified, no other probes are executed until this completes successfully. If this probe fails, the backend will be restarted, just as if the livenessProbe failed. This can be used to provide different probe parameters at the beginning of a backend's lifecycle, when it might take a long time to load data or warm a cache, than during steady-state operation.

recommendedConfigs
[]RecommendedConfig

RecommendedConfigs represents the recommended configurations for the backendRuntime.

BackendRuntimeStatus {#inference-llmaz-io-v1alpha1-BackendRuntimeStatus}

Appears in:

BackendRuntimeStatus defines the observed state of BackendRuntime

FieldDescription
conditions [Required]
[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition

Conditions represents the Inference condition.

ElasticConfig {#inference-llmaz-io-v1alpha1-ElasticConfig}

Appears in:

FieldDescription
minReplicas
int32

MinReplicas indicates the minimum number of inference workloads based on the traffic. Default to 1. MinReplicas couldn't be 0 now, will support serverless in the future.

maxReplicas
int32

MaxReplicas indicates the maximum number of inference workloads based on the traffic. Default to nil means there's no limit for the instance number.

scaleTrigger
ScaleTrigger

ScaleTrigger defines the rules to scale the workloads. Only one trigger cloud work at a time, mostly used in Playground. ScaleTrigger defined here will "overwrite" the scaleTrigger in the recommendedConfig.

HPATrigger {#inference-llmaz-io-v1alpha1-HPATrigger}

Appears in:

HPATrigger represents the configuration of the HorizontalPodAutoscaler. Inspired by kubernetes.io/pkg/apis/autoscaling/types.go#HorizontalPodAutoscalerSpec. Note: HPA component should be installed in prior.

FieldDescription
metrics
[]k8s.io/api/autoscaling/v2.MetricSpec

metrics contains the specifications for which to use to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond.

behavior
k8s.io/api/autoscaling/v2.HorizontalPodAutoscalerBehavior

behavior configures the scaling behavior of the target in both Up and Down directions (scaleUp and scaleDown fields respectively). If not set, the default HPAScalingRules for scale up and scale down are used.

PlaygroundSpec {#inference-llmaz-io-v1alpha1-PlaygroundSpec}

Appears in:

PlaygroundSpec defines the desired state of Playground

FieldDescription
replicas
int32

Replicas represents the replica number of inference workloads.

modelClaim
ModelClaim

ModelClaim represents claiming for one model, it's a simplified use case of modelClaims. Most of the time, modelClaim is enough. ModelClaim and modelClaims are exclusive configured.

modelClaims
ModelClaims

ModelClaims represents claiming for multiple models for more complicated use cases like speculative-decoding. ModelClaims and modelClaim are exclusive configured.

backendRuntimeConfig
BackendRuntimeConfig

BackendRuntimeConfig represents the inference backendRuntime configuration under the hood, e.g. vLLM, which is the default backendRuntime.

elasticConfig [Required]
ElasticConfig

ElasticConfig defines the configuration for elastic usage, e.g. the max/min replicas.

PlaygroundStatus {#inference-llmaz-io-v1alpha1-PlaygroundStatus}

Appears in:

PlaygroundStatus defines the observed state of Playground

FieldDescription
conditions [Required]
[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition

Conditions represents the Inference condition.

replicas [Required]
int32

Replicas track the replicas that have been created, whether ready or not.

selector [Required]
string

Selector points to the string form of a label selector which will be used by HPA.

RecommendedConfig {#inference-llmaz-io-v1alpha1-RecommendedConfig}

Appears in:

RecommendedConfig represents the recommended configurations for the backendRuntime, user can choose one of them to apply.

FieldDescription
name [Required]
string

Name represents the identifier of the config.

args
[]string

Args represents all the arguments for the command. Argument around with {{ .CONFIG }} is a configuration waiting for render.

resources
ResourceRequirements

Resources represents the resource requirements for backend, like cpu/mem, accelerators like GPU should not be defined here, but at the model flavors, or the values here will be overwritten.

sharedMemorySize
k8s.io/apimachinery/pkg/api/resource.Quantity

SharedMemorySize represents the size of /dev/shm required in the runtime of inference workload.

scaleTrigger
ScaleTrigger

ScaleTrigger defines the rules to scale the workloads. Only one trigger cloud work at a time.

ResourceRequirements {#inference-llmaz-io-v1alpha1-ResourceRequirements}

Appears in:

TODO: Do not support DRA yet, we can support that once needed.

FieldDescription
limits
k8s.io/api/core/v1.ResourceList

Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

requests
k8s.io/api/core/v1.ResourceList

Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

ScaleTrigger {#inference-llmaz-io-v1alpha1-ScaleTrigger}

Appears in:

ScaleTrigger defines the rules to scale the workloads. Only one trigger cloud work at a time, mostly used in Playground.

FieldDescription
hpa [Required]
HPATrigger

HPA represents the trigger configuration of the HorizontalPodAutoscaler.

ServiceSpec {#inference-llmaz-io-v1alpha1-ServiceSpec}

Appears in:

ServiceSpec defines the desired state of Service. Service controller will maintain multi-flavor of workloads with different accelerators for cost or performance considerations.

FieldDescription
modelClaims [Required]
ModelClaims

ModelClaims represents multiple claims for different models.

replicas
int32

Replicas represents the replica number of inference workloads.

workloadTemplate [Required]
sigs.k8s.io/lws/api/leaderworkerset/v1.LeaderWorkerTemplate

WorkloadTemplate defines the template for leader/worker pods

rolloutStrategy
sigs.k8s.io/lws/api/leaderworkerset/v1.RolloutStrategy

RolloutStrategy defines the strategy that will be applied to update replicas when a revision is made to the leaderWorkerTemplate.

ServiceStatus {#inference-llmaz-io-v1alpha1-ServiceStatus}

Appears in:

ServiceStatus defines the observed state of Service

FieldDescription
conditions [Required]
[]k8s.io/apimachinery/pkg/apis/meta/v1.Condition

Conditions represents the Inference condition.

replicas [Required]
int32

Replicas track the replicas that have been created, whether ready or not.

selector [Required]
string

Selector points to the string form of a label selector, the HPA will be able to autoscale your resource.