Scheduling

1: Scheduling Policies
2: Scheduler Configuration

1 - Scheduling Policies

A scheduling Policy can be used to specify the predicates and priorities that the kube-scheduler runs to filter and score nodes, respectively.

You can set a scheduling policy by running kube-scheduler --policy-config-file <filename> or kube-scheduler --policy-configmap <ConfigMap> and using the Policy type.

Predicates

The following predicates implement filtering:

PodFitsHostPorts: Checks if a Node has free ports (the network protocol kind) for the Pod ports the Pod is requesting.
PodFitsHost: Checks if a Pod specifies a specific Node by its hostname.
PodFitsResources: Checks if the Node has free resources (eg, CPU and Memory) to meet the requirement of the Pod.
MatchNodeSelector: Checks if a Pod's Node Selector matches the Node's label(s).
NoVolumeZoneConflict: Evaluate if the Volumes that a Pod requests are available on the Node, given the failure zone restrictions for that storage.
NoDiskConflict: Evaluates if a Pod can fit on a Node due to the volumes it requests, and those that are already mounted.
MaxCSIVolumeCount: Decides how many CSI volumes should be attached, and whether that's over a configured limit.
PodToleratesNodeTaints: checks if a Pod's tolerations can tolerate the Node's taints.
CheckVolumeBinding: Evaluates if a Pod can fit due to the volumes it requests. This applies for both bound and unbound PVCs.

Priorities

The following priorities implement scoring:

SelectorSpreadPriority: Spreads Pods across hosts, considering Pods that belong to the same Service, StatefulSet or ReplicaSet.
InterPodAffinityPriority: Implements preferred inter pod affininity and antiaffinity.
LeastRequestedPriority: Favors nodes with fewer requested resources. In other words, the more Pods that are placed on a Node, and the more resources those Pods use, the lower the ranking this policy will give.
MostRequestedPriority: Favors nodes with most requested resources. This policy will fit the scheduled Pods onto the smallest number of Nodes needed to run your overall set of workloads.
RequestedToCapacityRatioPriority: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
BalancedResourceAllocation: Favors nodes with balanced resource usage.
NodePreferAvoidPodsPriority: Prioritizes nodes according to the node annotation scheduler.alpha.kubernetes.io/preferAvoidPods. You can use this to hint that two different Pods shouldn't run on the same Node.
NodeAffinityPriority: Prioritizes nodes according to node affinity scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. You can read more about this in Assigning Pods to Nodes.
TaintTolerationPriority: Prepares the priority list for all the nodes, based on the number of intolerable taints on the node. This policy adjusts a node's rank taking that list into account.
ImageLocalityPriority: Favors nodes that already have the container images for that Pod cached locally.
ServiceSpreadingPriority: For a given Service, this policy aims to make sure that the Pods for the Service run on different nodes. It favours scheduling onto nodes that don't have Pods for the service already assigned there. The overall outcome is that the Service becomes more resilient to a single Node failure.
EqualPriority: Gives an equal weight of one to all nodes.
EvenPodsSpreadPriority: Implements preferred pod topology spread constraints.

What's next

Learn about scheduling
Learn about kube-scheduler Configuration
Read the kube-scheduler configuration reference (v1beta2)
Read the kube-scheduler Policy reference (v1)

2 - Scheduler Configuration

FEATURE STATE: Kubernetes v1.19 [beta]

You can customize the behavior of the kube-scheduler by writing a configuration file and passing its path as a command line argument.

A scheduling Profile allows you to configure the different stages of scheduling in the kube-scheduler. Each stage is exposed in an extension point. Plugins provide scheduling behaviors by implementing one or more of these extension points.

You can specify scheduling profiles by running kube-scheduler --config <filename>, using the KubeSchedulerConfiguration (v1beta1 or v1beta2) struct.

A minimal configuration looks as follows:

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /etc/srv/kubernetes/kube-scheduler/kubeconfig

Profiles

You can configure a single instance of kube-scheduler to run multiple profiles.

Extension points

Scheduling happens in a series of stages that are exposed through the following extension points:

queueSort: These plugins provide an ordering function that is used to sort pending Pods in the scheduling queue. Exactly one queue sort plugin may be enabled at a time.
preFilter: These plugins are used to pre-process or check information about a Pod or the cluster before filtering. They can mark a pod as unschedulable.
filter: These plugins are the equivalent of Predicates in a scheduling Policy and are used to filter out nodes that can not run the Pod. Filters are called in the configured order. A pod is marked as unschedulable if no nodes pass all the filters.
postFilter: These plugins are called in their configured order when no feasible nodes were found for the pod. If any postFilter plugin marks the Pod schedulable, the remaining plugins are not called.
preScore: This is an informational extension point that can be used for doing pre-scoring work.
score: These plugins provide a score to each node that has passed the filtering phase. The scheduler will then select the node with the highest weighted scores sum.
reserve: This is an informational extension point that notifies plugins when resources have been reserved for a given Pod. Plugins also implement an Unreserve call that gets called in the case of failure during or after Reserve.
permit: These plugins can prevent or delay the binding of a Pod.
preBind: These plugins perform any work required before a Pod is bound.
bind: The plugins bind a Pod to a Node. bind plugins are called in order and once one has done the binding, the remaining plugins are skipped. At least one bind plugin is required.
postBind: This is an informational extension point that is called after a Pod has been bound.

For each extension point, you could disable specific default plugins or enable your own. For example:

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
  - plugins:
      score:
        disabled:
        - name: PodTopologySpread
        enabled:
        - name: MyCustomPluginA
          weight: 2
        - name: MyCustomPluginB
          weight: 1

You can use * as name in the disabled array to disable all default plugins for that extension point. This can also be used to rearrange plugins order, if desired.

Scheduling plugins

The following plugins, enabled by default, implement one or more of these extension points:

ImageLocality: Favors nodes that already have the container images that the Pod runs. Extension points: score.
TaintToleration: Implements taints and tolerations. Implements extension points: filter, preScore, score.
NodeName: Checks if a Pod spec node name matches the current node. Extension points: filter.
NodePorts: Checks if a node has free ports for the requested Pod ports. Extension points: preFilter, filter.
NodeAffinity: Implements node selectors and node affinity. Extension points: filter, score.
PodTopologySpread: Implements Pod topology spread. Extension points: preFilter, filter, preScore, score.
NodeUnschedulable: Filters out nodes that have .spec.unschedulable set to true. Extension points: filter.
NodeResourcesFit: Checks if the node has all the resources that the Pod is requesting. The score can use one of three strategies: LeastAllocated (default), MostAllocated and RequestedToCapacityRatio. Extension points: preFilter, filter, score.
NodeResourcesBalancedAllocation: Favors nodes that would obtain a more balanced resource usage if the Pod is scheduled there. Extension points: score.
VolumeBinding: Checks if the node has or if it can bind the requested volumes. Extension points: preFilter, filter, reserve, preBind, score.
Note: score extension point is enabled when VolumeCapacityPriority feature is enabled. It prioritizes the smallest PVs that can fit the requested volume size.
VolumeRestrictions: Checks that volumes mounted in the node satisfy restrictions that are specific to the volume provider. Extension points: filter.
VolumeZone: Checks that volumes requested satisfy any zone requirements they might have. Extension points: filter.
NodeVolumeLimits: Checks that CSI volume limits can be satisfied for the node. Extension points: filter.
EBSLimits: Checks that AWS EBS volume limits can be satisfied for the node. Extension points: filter.
GCEPDLimits: Checks that GCP-PD volume limits can be satisfied for the node. Extension points: filter.
AzureDiskLimits: Checks that Azure disk volume limits can be satisfied for the node. Extension points: filter.
InterPodAffinity: Implements inter-Pod affinity and anti-affinity. Extension points: preFilter, filter, preScore, score.
PrioritySort: Provides the default priority based sorting. Extension points: queueSort.
DefaultBinder: Provides the default binding mechanism. Extension points: bind.
DefaultPreemption: Provides the default preemption mechanism. Extension points: postFilter.

You can also enable the following plugins, through the component config APIs, that are not enabled by default:

SelectorSpread: Favors spreading across nodes for Pods that belong to Services, ReplicaSets and StatefulSets. Extension points: preScore, score.
CinderLimits: Checks that OpenStack Cinder volume limits can be satisfied for the node. Extension points: filter.

The following plugins are deprecated and can only be enabled in a v1beta1 configuration:

NodeResourcesLeastAllocated: Favors nodes that have a low allocation of resources. Extension points: score.
NodeResourcesMostAllocated: Favors nodes that have a high allocation of resources. Extension points: score.
RequestedToCapacityRatio: Favor nodes according to a configured function of the allocated resources. Extension points: score.
NodeLabel: Filters and / or scores a node according to configured label(s). Extension points: filter, score.
ServiceAffinity: Checks that Pods that belong to a Service fit in a set of nodes defined by configured labels. This plugin also favors spreading the Pods belonging to a Service across nodes. Extension points: preFilter, filter, score.
NodePreferAvoidPods: Prioritizes nodes according to the node annotation scheduler.alpha.kubernetes.io/preferAvoidPods. Extension points: score.

Multiple profiles

You can configure kube-scheduler to run more than one profile. Each profile has an associated scheduler name and can have a different set of plugins configured in its extension points.

With the following sample configuration, the scheduler will run with two profiles: one with the default plugins and one with all scoring plugins disabled.

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
  - schedulerName: no-scoring-scheduler
    plugins:
      preScore:
        disabled:
        - name: '*'
      score:
        disabled:
        - name: '*'

Pods that want to be scheduled according to a specific profile can include the corresponding scheduler name in its .spec.schedulerName.

By default, one profile with the scheduler name default-scheduler is created. This profile includes the default plugins described above. When declaring more than one profile, a unique scheduler name for each of them is required.

If a Pod doesn't specify a scheduler name, kube-apiserver will set it to default-scheduler. Therefore, a profile with this scheduler name should exist to get those pods scheduled.

Note: Pod's scheduling events have .spec.schedulerName as the ReportingController. Events for leader election use the scheduler name of the first profile in the list.

Note: All profiles must use the same plugin in the queueSort extension point and have the same configuration parameters (if applicable). This is because the scheduler only has one pending pods queue.

Scheduler configuration migrations

v1beta1 → v1beta2

With the v1beta2 configuration version, you can use a new score extension for the NodeResourcesFit plugin. The new extension combines the functionalities of the NodeResourcesLeastAllocated, NodeResourcesMostAllocated and RequestedToCapacityRatio plugins. For example, if you previously used the NodeResourcesMostAllocated plugin, you would instead use NodeResourcesFit (enabled by default) and add a pluginConfig with a scoreStrategy that is similar to:
```
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - args:
      scoringStrategy:
        resources:
        - name: cpu
          weight: 1
        type: MostAllocated
    name: NodeResourcesFit
```
The scheduler plugin NodeLabel is deprecated; instead, use the NodeAffinity plugin (enabled by default) to achieve similar behavior.
The scheduler plugin ServiceAffinity is deprecated; instead, use the InterPodAffinity plugin (enabled by default) to achieve similar behavior.
The scheduler plugin NodePreferAvoidPods is deprecated; instead, use node taints to achieve similar behavior.
A plugin enabled in a v1beta2 configuration file takes precedence over the default configuration for that plugin.
Invalid host or port configured for scheduler healthz and metrics bind address will cause validation failure.

What's next

Read the kube-scheduler reference
Learn about scheduling
Read the kube-scheduler configuration (v1beta1) reference
Read the kube-scheduler configuration (v1beta2) reference