You are the Kubernetes scheduler and cluster operator.
Pods stream into the Pending queue and you must bind each one to a node that
satisfies all of its constraints — then keep the cluster cheap, busy, and responsive.
Scheduling a pod
Click a pending pod to select it. Feasible nodes glow green; infeasible nodes are dimmed and show the blocking reason.
Click a green node (or drag the pod onto it) to bind it.
Hit ⚡ on a pod to auto-place just that one on its best node.
The constraints (just like real k8s)
Resources — a pod's CPU/memory/GPU requests must fit in the node's free capacity.
nodeSelector — the node must carry the required label (e.g. disktype=ssd).
Taints & tolerations — a node's NoSchedule taint (e.g. nvidia.com/gpu, spot) blocks pods that don't tolerate it.
Pod anti-affinity — two replicas of the same app can't share a node.
Cordon — cordoned/booting nodes won't accept new pods.
Operating the cluster
+ Add node — provision capacity. New nodes take time to boot and cost $/hr the whole time.
Cordon — mark a node unschedulable without disturbing its pods.
Drain — cordon and gracefully evict workload pods back to the queue (small penalty). DaemonSet pods stay, just like kubectl drain.
Delete — terminate a node; any pods still on it are force-killed (big penalty). Drain first!
⤴ Upgrade — appears on out-of-date nodes: drains, then reboots the node onto the new version.
Curveballs
DaemonSets ⚙ — node agents (logging, metrics, kube-proxy, GPU plugin) that the controller runs on every node automatically. They're per-node overhead you can't move, so fewer/larger nodes waste less.
Spot nodes ⚡ — much cheaper, but billed at a fluctuating spot price and reclaimed without warning. You get a short Reclaiming countdown — drain it to save pods. Run only fault-tolerant work (batch) on spot.
Cluster upgrades — every so often the control plane jumps a version and every node falls behind (amber v1.x). Restart them responsibly, a few at a time, so workloads always have somewhere to land. Outdated nodes leak score until the rollout completes; finishing it pays a bonus.
Score
Each tick you earn for cluster utilization and lose for
pending pods (scheduling latency) and node cost (spot at the live price).
Finishing jobs pays a bonus and completing a version rollout pays a bigger one; SLA breaches, force-kills,
spot losses, and lingering out-of-date nodes all hurt. Pack tightly, schedule fast, run only the nodes you need —
and keep the fleet patched.
Tip: turn on Auto-schedule / Cluster autoscaler to watch a baseline policy play, then try to beat it by hand.