Closed
Description
Feature Description
TensorFusion has a built-in lightweight GPU scheduler and requires no Kubernetes device-plugin. However, this means Kubernetes built-in LimitRanges and ResourceQuotas won't take effect on vGPU resources.
To limit namespace-level total vGPU capacity and enforce min/max constraints on tflops/vram/gpu-replicas per workload, we need a similar mechanism to what LimitRanges and ResourceQuotas provide for standard Kubernetes resources.
Motivation
- Multi-tenancy support: Enable fair resource sharing across teams and namespaces
- Cost control: Prevent runaway GPU usage and associated costs
- Resource planning: Ensure predictable resource allocation and prevent resource starvation
- Compliance: Meet organizational policies for resource governance
- Performance isolation: Prevent single workloads from monopolizing cluster resources
Design Proposal
The resource quotas should be simpler than Kubernetes native resources, and only supports namespace level limits, the quota CR's name must be the same as namespace it limits.
GPUResourceQuota CRD
apiVersion: tensor-fusion.ai/v1
kind: GPUResourceQuota
metadata:
name: name-of-namespace
spec:
# Total namespace limits (similar to ResourceQuotas)
total:
requests.tflops: "1000"
requests.vram: "20Gi"
limits.tflops: "1000"
limits.vram: "20Gi"
workers: "100"
alertThresholdPercent: 95
# Per-workload limits (similar to LimitRanges)
single:
max:
tflops: "200"
vram: "4Gi"
workers: "10"
min:
tflops: "10m"
vram: "256Mi"
workers: "1"
default:
tflops: "100m"
vram: "1Gi"
defaultRequest:
tflops: "50m"
vram: "512Mi"
GPUResourceQuotaStatus
status:
used:
requests.tflops: "450"
requests.vram: "8Gi"
limits.tflops: "600"
limits.vram: "12Gi"
workers: "25"
availablePercent:
requests.tflops: 30
requests.vram: 10
limits.tflops: 10
limits.vram: 20
workers: 20
Acceptance Criteria
Core Functionality
- GPUResourceQuota CRD: Define and implement GPUResourceQuota custom resource
- Admission Controller: Enhance current mutating webhook to enforce quotas on workload creation/updates
- Resource Tracking: Track current usage across all workloads in namespace, when availablePercent < (100- alertThresholdPercent), should trigger event to alert cluster admin to check resource usage of this namespace.
- Quota Enforcement: Block workload creation when quotas would be exceeded