Configure Scaling for Inference Services
TOC
Introduction
This document provides a step-by-step guide for configuring autoscaling up and down for inference services. With these settings, you can optimize resource usage, ensure service availability during high load, and release resources during low load.
About the Autoscaler
Knative Serving supports two autoscalers: Knative Pod Autoscaler (KPA) and Kubernetes' Horizontal Pod Autoscaler (HPA). By default, our services use the Knative Pod Autoscaler (KPA).
KPA is designed for serverless workloads and can quickly scale up based on concurrent requests or RPS (requests per second), and can scale services to zero replicas to save costs. HPA is more general and typically scales based on metrics like CPU or memory usage. This guide primarily focuses on configuring services via the Knative Pod Autoscaler (KPA).
Steps
Autoscaling Down Configuration
This section describes how to configure inference services to automatically scale down to zero replicas when there is no traffic, or to maintain a minimum number of replicas.
Enable/Disable Scale to Zero
You can configure whether to allow the inference service to scale down to zero replicas when there is no traffic. By default, this value is true, which allows scaling to zero.
Using InferenceService Resource Parameters
In the spec.predictor field of the InferenceService, set the minReplicas parameter.
-
minReplicas: 0: Allows scaling down to zero replicas. -
minReplicas: 1: Disables scaling down to zero replicas, keeping at least one replica.
Platform-wide Disable of Scale to Zero
Once the platform-wide feature is disabled, the minReplicas: 0 configuration for all services will be ignored.
You can modify the global ConfigMap to disable the platform's scale-to-zero feature. This configuration has the highest priority and will override the settings in all individual InferenceService resources.
In the config-autoscaler ConfigMap in the knative-serving namespace, modify the value of enable-scale-to-zero to "false"
- Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.
Configure Pod Retention Period After Scaling to Zero
This setting determines the minimum time the last Pod remains active after the autoscaler decides to scale to zero. This helps the service respond quickly when it starts receiving traffic again. The default value is 0s
You can choose to configure this for a single service or modify the global ConfigMap to make this setting effective for all services.
Method 1: Using InferenceService Annotations
In the spec.predictor.annotations of the InferenceService, add the autoscaling.knative.dev/scale-to-zero-pod-retention-period annotation.
Method 2: Using a Global ConfigMap
In the config-autoscaler ConfigMap in the knative-serving namespace, modify the value of scale-to-zero-pod-retention-period to a non-negative duration string, such as "1m5s".
- Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.
Configure the Grace Period for Scaling to Zero
This setting adds a delay before removing the last replica after traffic stops, ensuring the activator/routing path is ready and preventing request loss during the transition to zero.
This value should only be adjusted if you encounter lost requests due to services scaling to zero. It does not affect the retention time of the last replica after there's no traffic, nor does it guarantee that the replica will be retained during this period.
Method: Using a Global ConfigMap
In the config-autoscaler ConfigMap in the knative-serving namespace, modify the value of scale-to-zero-grace-period to a duration string, such as "40s".
- Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.
Autoscaling Up Configuration
This section describes how to configure the inference service to automatically scale up in response to increased traffic.
Configure Concurrency Thresholds
Concurrency determines the number of requests that each application replica can handle simultaneously. You can set concurrency with a soft limit or a hard limit.
- Soft Limit:A target limit that can be temporarily exceeded during a traffic surge, but which will trigger autoscaling to maintain the target value. The default value is
100. - Hard Limit:A strict upper bound. When concurrency reaches this value, excess requests will be buffered and queued for processing. The default value is
0, which means unlimited.
If both a soft and a hard limit are specified, the smaller of the two values will be used. This prevents the Autoscaler from having a target value that is not permitted by the hard limit value.
You can choose to configure this for a single service or modify the global ConfigMap to make this setting effective for all services.
Method 1: Using InferenceService Resource Parameters
-
Soft Limit:In
spec.predictor, setscaleTargetand setscaleMetrictoconcurrency. -
Hard Limit:In
spec.predictor, setcontainerConcurrency
Method 2: Using a Global ConfigMap
- Soft Limit:In the
config-autoscalerConfigMap, setcontainer-concurrency-target-default. - Hard Limit:There is no global setting for the hard limit, as it affects request buffering and queuing.
Target Utilization Percentage
This value specifies the target percentage the autoscaler aims for when metric=concurrency, allowing proactive scale‑up before the hard limit. Default: 70. It does not apply when using RPS.
Method 1: Using InferenceService Annotations
In the spec.predictor.annotations of the InferenceService, add the autoscaling.knative.dev/target-utilization-percentage annotation.
Method 2: Using a Global ConfigMap
In the config-autoscaler ConfigMap, set container-concurrency-target-percentage.
- Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.
Configure Requests Per Second (RPS) Target
You can change the scaling metric from concurrency to requests per second (RPS). The default value is 200.
Note: In RPS mode, the concurrency target‑percentage setting is not used.
Method 1: Using InferenceService Resource Parameters
In spec.predictor, set scaleTarget and set scaleMetric to rps.
Method 2: Using a Global ConfigMap
In the config-autoscaler ConfigMap, set requests-per-second-target-default.
- Please ensure this annotation exists, otherwise your configuration will be reverted to the default value.