Skip to content

Kubernetes Deep Dive

As of June 2026, talk in terms of Kubernetes 1.35/1.36 in production and 1.37 as upcoming. Do not anchor your answers to old assumptions like PodSecurityPolicy, beta Ingress-only traffic management, or hand-managed GPU daemonsets.

Modern Kubernetes vocabulary for this role:

  • Gateway API for expressive traffic management.
  • Device plugins and Dynamic Resource Allocation concepts for specialized hardware.
  • GPU Operator instead of bespoke driver/toolkit installs.
  • Kueue for quota-aware AI/ML batch and queued workloads.
  • Karpenter or cloud-native node autoscaling for right-sized node provisioning.
  • Cilium/eBPF for networking, security, and packet-level visibility.
  • OpenTelemetry plus Prometheus-compatible metrics for portable observability.
  • GitOps with Argo CD or Flux for declarative cluster state.
  • Policy engines: Kyverno, OPA Gatekeeper, ValidatingAdmissionPolicy.
flowchart TD
  Desired[Desired state in API server] --> Scheduler[Scheduler]
  Scheduler --> Node[Kubelet on selected node]
  Node --> Runtime[Container runtime]
  Runtime --> Pod[Pod containers]
  Controller[Controllers] --> Desired
  CNI[CNI] --> Pod
  CSI[CSI] --> Pod
  Device[Device plugin] --> Node
  Observability[Events and metrics] -. feedback .-> Controller

Know the pieces and failure implications:

ComponentWhat it doesInterview failure mode
API serverFront door for cluster state.Latency/errors block deploys and controllers, but existing data plane may keep serving.
etcdDurable cluster state.Quorum, disk latency, compaction, snapshot restore, backup validation.
schedulerAssigns pods to nodes.Pending pods, topology constraints, resource fragmentation, stale assumptions.
controller managerReconciliation controllers.Deployments, nodes, endpoints, jobs stop converging.
cloud controller managerCloud provider integration.Load balancers, routes, node lifecycle integration.

Strong line:

I separate control-plane availability from data-plane serving. A control-plane issue can prevent changes while existing inference pods continue serving, so mitigation depends on whether the user path or the operations path is impaired.

The API server is not only a REST endpoint. It is the admission, validation, persistence, and watch boundary for the cluster.

Request path:

  1. Authentication.
  2. Authorization.
  3. Mutating admission.
  4. Object schema/defaulting/validation.
  5. Validating admission.
  6. Persistence to etcd.
  7. Watch notification to controllers, schedulers, kubelets, and clients.
sequenceDiagram
  participant User as Client or controller
  participant API as API server
  participant Auth as Authn/Authz
  participant Admit as Admission chain
  participant Etcd as etcd
  participant Watch as Watch cache
  User->>API: create or update object
  API->>Auth: authenticate and authorize
  API->>Admit: mutate then validate
  API->>Etcd: persist object
  Etcd-->>API: committed revision
  API->>Watch: publish watch event
  Watch-->>User: observed state changes

Failure modes to speak to:

  • API server read latency: kubectl and controllers slow down, but existing pods may keep serving.
  • API server write failure: rollouts, scaling, leader election renewals, and status updates can fail.
  • Admission webhook timeout: creates a deploy outage even when workloads are otherwise healthy.
  • Watch cache pressure: controllers lag, endpoints stale, and status appears inconsistent.
  • Request priority and fairness misconfiguration: noisy controllers can starve human or critical control traffic.

Strong answer:

I distinguish read path, write path, admission path, and watch path. “The API server is slow” is not precise enough; the mitigation differs if admission is blocking writes, etcd is slow, or controllers are simply behind on watches.

Etcd is the durable source of Kubernetes state. The main interview points are quorum, latency, compaction, and restore discipline.

Know:

  • Etcd needs quorum; losing quorum makes writes unavailable.
  • Disk latency matters because every committed write goes through durable storage.
  • Large objects and high object churn increase pressure.
  • Watch history depends on revisions and compaction.
  • Defragmentation and compaction are operational tasks, not trivia.
  • Snapshots are only useful if restore is tested.

Incident drill:

Deploys are failing and API server logs show etcd request timeouts.

Senior flow:

  1. Stop nonessential controllers or deploy automation if they are amplifying writes.
  2. Check etcd quorum, leader changes, disk latency, database size, and network between control-plane nodes.
  3. Determine whether data-plane serving is still healthy.
  4. Avoid mass restarts of control-plane components without evidence.
  5. If restore is needed, follow tested snapshot restore process and validate cluster revision behavior.
  6. After recovery, reduce object churn, tune compaction/defrag, and add alerts on latency before timeout.

Key layers:

  • Kubelet: pod lifecycle, volume mount, device allocation, node status.
  • Container runtime: containerd/CRI-O, image pulls, runtime hooks.
  • CNI: pod networking, service routing, network policy.
  • CSI: volume provisioning and mount.
  • Device plugins: GPUs, NICs, FPGAs, other specialized resources.
  • Node OS/kernel/driver stack.

Kubelet is the node-side reconciler. It observes assigned pods, prepares dependencies, asks CRI to run containers, reports status, and executes probes.

Startup path:

flowchart TD
  Assigned[Pod assigned to node] --> Pull[Pull images]
  Pull --> Volumes[Mount volumes and projected secrets]
  Volumes --> Devices[Allocate devices and runtime class]
  Devices --> Sandbox[Create pod sandbox]
  Sandbox --> Containers[Start init and app containers]
  Containers --> Probes[Startup and readiness probes]
  Probes --> Ready[Ready condition and EndpointSlice update]

Failure interpretation:

  • ContainerCreating: image pull, volume mount, CNI, sandbox, or device allocation.
  • CrashLoopBackOff: process exits after container starts; look at previous logs and exit code.
  • Ready false: app/probe/sidecar/model readiness, not necessarily Kubernetes failure.
  • Running but no endpoint: readiness gate, selector mismatch, EndpointSlice controller lag, or route binding issue.
  • GPU allocated but invisible: device plugin allocation, container runtime hook, driver/toolkit, or security context.

The scheduler generally works through queue, filter, score, reserve, permit, bind, then post-bind behavior. For interviews, use those phases to explain pending pods precisely.

flowchart LR
  Pending[Pending pod] --> Queue[Scheduling queue]
  Queue --> Filter[Filter infeasible nodes]
  Filter --> Score[Score feasible nodes]
  Score --> Reserve[Reserve resources]
  Reserve --> Permit[Permit or wait]
  Permit --> Bind[Bind pod to node]
  Bind --> Kubelet[Kubelet starts pod]

Common GPU scheduling traps:

  • A pod requests nvidia.com/gpu: 4; four free GPUs across four nodes do not help.
  • A topology constraint can make a resource look unavailable even when allocatable is nonzero.
  • A stale device plugin can leave allocatable wrong until kubelet/plugin state converges.
  • Preemption can find a victim but still fail if PDBs, affinity, or resource shape prevent fit.
  • Autoscaler can add nodes that are not usable until GPU Operator operands validate the stack.
APIUse
DeploymentStateless services, model servers with normal rollout semantics.
StatefulSetStable identity/storage, less common for stateless inference but relevant for stateful control components.
DaemonSetNode agents: GPU Operator operands, log collectors, CNI, node exporters.
Job/CronJobBatch tasks, validation, maintenance, model prewarming jobs.
HPA/VPAPod scaling by metrics or resource recommendations.
PDBKeeps voluntary disruptions from draining too much capacity.
PriorityClassProtects critical serving workloads from lower-priority work.

When a pod is pending, inspect:

  • Requested resources: CPU, memory, nvidia.com/gpu, ephemeral storage.
  • Node selectors and node affinity.
  • Taints/tolerations.
  • Pod anti-affinity and topology spread constraints.
  • PVC binding.
  • RuntimeClass.
  • Image architecture.
  • Quotas and LimitRanges.
  • PDB only affects eviction, not initial scheduling.

For GPUs, add:

  • Device plugin healthy and advertising resources.
  • Node labels for SKU, driver, MIG mode, topology.
  • GPU fragmentation: total fleet capacity is not the same as fit on one eligible node.
  • Workload asks for whole GPUs unless MIG/time-slicing/MPS is explicitly configured.
ScopeExamplesMove
One podbad config, image, secret, readiness.describe, logs, events, config diff.
One noderuntime, kubelet, disk, driver, GPU error.cordon, inspect node, compare healthy peer.
One pool/SKUdriver image, taint, GPU Operator operand, hardware class.halt rollout, compare pool labels and daemonsets.
One clustercontrol plane, CNI, DNS, admission, quota.check apiserver, controllers, CoreDNS, webhooks.
Globalartifact store, registry, identity, upstream service.failover, dependency status, rate limits.
flowchart TD
  Symptom[Symptom] --> OnePod{One pod}
  OnePod -- yes --> Spec[Spec, image, env, probes, resources]
  OnePod -- no --> OneNode{One node}
  OneNode -- yes --> Node[Runtime, kubelet, CNI, device plugin]
  OneNode -- no --> OnePool{One pool or zone}
  OnePool -- yes --> Capacity[Capacity, labels, taints, topology]
  OnePool -- no --> Cluster[API server, admission, DNS, CNI, dependency]

Modern platform teams enforce production contracts at admission:

  • Required labels/owners.
  • Resource requests and limits.
  • Allowed registries and signed images.
  • No privileged pods except approved system namespaces.
  • Required readiness/liveness/startup probes.
  • GPU workloads must declare resource envelope and pool selector.
  • Runtime image and model artifact must be versioned.

Weak:

I would check Kubernetes events and restart the pod.

Strong:

I would first identify whether this is a scheduling, startup, readiness, or serving failure. Then I would scope it to pod, node, pool, cluster, or dependency. For a GPU workload, I would verify scheduler constraints and the device plugin, then drop below Kubernetes to check kubelet, container runtime, driver/toolkit compatibility, DCGM, and kernel GPU errors.

Question: “A deployment is stuck at 50 percent rollout.”

Answer:

  1. Check rollout status and unavailable replicas.
  2. Inspect new ReplicaSet pods: pending, image pull, crash, readiness, or failed scheduling.
  3. Compare old vs new config: image, env, probes, resources, node selectors, secrets, service account.
  4. Check PDB and maxUnavailable/maxSurge.
  5. If GPU-specific: resource requests, device plugin, node pool labels, model warmup, GPU memory.
  6. Mitigate: pause rollout, rollback if user impact, or adjust rollout strategy.
  7. Prevent: canary gate, startup probe, representative synthetic request, resource-envelope validation.