Kubernetes Deep Dive

2026 Baseline

As of June 2026, talk in terms of Kubernetes 1.35/1.36 in production and 1.37 as upcoming. Do not anchor your answers to old assumptions like PodSecurityPolicy, beta Ingress-only traffic management, or hand-managed GPU daemonsets.

Modern Kubernetes vocabulary for this role:

Gateway API for expressive traffic management.
Device plugins and Dynamic Resource Allocation concepts for specialized hardware.
GPU Operator instead of bespoke driver/toolkit installs.
Kueue for quota-aware AI/ML batch and queued workloads.
Karpenter or cloud-native node autoscaling for right-sized node provisioning.
Cilium/eBPF for networking, security, and packet-level visibility.
OpenTelemetry plus Prometheus-compatible metrics for portable observability.
GitOps with Argo CD or Flux for declarative cluster state.
Policy engines: Kyverno, OPA Gatekeeper, ValidatingAdmissionPolicy.

flowchart TD
  Desired[Desired state in API server] --> Scheduler[Scheduler]
  Scheduler --> Node[Kubelet on selected node]
  Node --> Runtime[Container runtime]
  Runtime --> Pod[Pod containers]
  Controller[Controllers] --> Desired
  CNI[CNI] --> Pod
  CSI[CSI] --> Pod
  Device[Device plugin] --> Node
  Observability[Events and metrics] -. feedback .-> Controller

Control Plane

Know the pieces and failure implications:

Component	What it does	Interview failure mode
API server	Front door for cluster state.	Latency/errors block deploys and controllers, but existing data plane may keep serving.
etcd	Durable cluster state.	Quorum, disk latency, compaction, snapshot restore, backup validation.
scheduler	Assigns pods to nodes.	Pending pods, topology constraints, resource fragmentation, stale assumptions.
controller manager	Reconciliation controllers.	Deployments, nodes, endpoints, jobs stop converging.
cloud controller manager	Cloud provider integration.	Load balancers, routes, node lifecycle integration.

Strong line:

I separate control-plane availability from data-plane serving. A control-plane issue can prevent changes while existing inference pods continue serving, so mitigation depends on whether the user path or the operations path is impaired.

API Server Internals

The API server is not only a REST endpoint. It is the admission, validation, persistence, and watch boundary for the cluster.

Request path:

Authentication.
Authorization.
Mutating admission.
Object schema/defaulting/validation.
Validating admission.
Persistence to etcd.
Watch notification to controllers, schedulers, kubelets, and clients.

sequenceDiagram
  participant User as Client or controller
  participant API as API server
  participant Auth as Authn/Authz
  participant Admit as Admission chain
  participant Etcd as etcd
  participant Watch as Watch cache
  User->>API: create or update object
  API->>Auth: authenticate and authorize
  API->>Admit: mutate then validate
  API->>Etcd: persist object
  Etcd-->>API: committed revision
  API->>Watch: publish watch event
  Watch-->>User: observed state changes

Failure modes to speak to:

API server read latency: kubectl and controllers slow down, but existing pods may keep serving.
API server write failure: rollouts, scaling, leader election renewals, and status updates can fail.
Admission webhook timeout: creates a deploy outage even when workloads are otherwise healthy.
Watch cache pressure: controllers lag, endpoints stale, and status appears inconsistent.
Request priority and fairness misconfiguration: noisy controllers can starve human or critical control traffic.

Strong answer:

I distinguish read path, write path, admission path, and watch path. “The API server is slow” is not precise enough; the mitigation differs if admission is blocking writes, etcd is slow, or controllers are simply behind on watches.

Etcd Realities

Etcd is the durable source of Kubernetes state. The main interview points are quorum, latency, compaction, and restore discipline.

Know:

Etcd needs quorum; losing quorum makes writes unavailable.
Disk latency matters because every committed write goes through durable storage.
Large objects and high object churn increase pressure.
Watch history depends on revisions and compaction.
Defragmentation and compaction are operational tasks, not trivia.
Snapshots are only useful if restore is tested.

Incident drill:

Deploys are failing and API server logs show etcd request timeouts.

Senior flow:

Stop nonessential controllers or deploy automation if they are amplifying writes.
Check etcd quorum, leader changes, disk latency, database size, and network between control-plane nodes.
Determine whether data-plane serving is still healthy.
Avoid mass restarts of control-plane components without evidence.
If restore is needed, follow tested snapshot restore process and validate cluster revision behavior.
After recovery, reduce object churn, tune compaction/defrag, and add alerts on latency before timeout.

Data Plane

Key layers:

Kubelet: pod lifecycle, volume mount, device allocation, node status.
Container runtime: containerd/CRI-O, image pulls, runtime hooks.
CNI: pod networking, service routing, network policy.
CSI: volume provisioning and mount.
Device plugins: GPUs, NICs, FPGAs, other specialized resources.
Node OS/kernel/driver stack.

Kubelet And Pod Startup Internals

Kubelet is the node-side reconciler. It observes assigned pods, prepares dependencies, asks CRI to run containers, reports status, and executes probes.

Startup path:

flowchart TD
  Assigned[Pod assigned to node] --> Pull[Pull images]
  Pull --> Volumes[Mount volumes and projected secrets]
  Volumes --> Devices[Allocate devices and runtime class]
  Devices --> Sandbox[Create pod sandbox]
  Sandbox --> Containers[Start init and app containers]
  Containers --> Probes[Startup and readiness probes]
  Probes --> Ready[Ready condition and EndpointSlice update]

Failure interpretation:

ContainerCreating: image pull, volume mount, CNI, sandbox, or device allocation.
CrashLoopBackOff: process exits after container starts; look at previous logs and exit code.
Ready false: app/probe/sidecar/model readiness, not necessarily Kubernetes failure.
Running but no endpoint: readiness gate, selector mismatch, EndpointSlice controller lag, or route binding issue.
GPU allocated but invisible: device plugin allocation, container runtime hook, driver/toolkit, or security context.

Scheduler Internals

The scheduler generally works through queue, filter, score, reserve, permit, bind, then post-bind behavior. For interviews, use those phases to explain pending pods precisely.

flowchart LR
  Pending[Pending pod] --> Queue[Scheduling queue]
  Queue --> Filter[Filter infeasible nodes]
  Filter --> Score[Score feasible nodes]
  Score --> Reserve[Reserve resources]
  Reserve --> Permit[Permit or wait]
  Permit --> Bind[Bind pod to node]
  Bind --> Kubelet[Kubelet starts pod]

Common GPU scheduling traps:

A pod requests nvidia.com/gpu: 4; four free GPUs across four nodes do not help.
A topology constraint can make a resource look unavailable even when allocatable is nonzero.
A stale device plugin can leave allocatable wrong until kubelet/plugin state converges.
Preemption can find a victim but still fail if PDBs, affinity, or resource shape prevent fit.
Autoscaler can add nodes that are not usable until GPU Operator operands validate the stack.

Workload APIs

API	Use
Deployment	Stateless services, model servers with normal rollout semantics.
StatefulSet	Stable identity/storage, less common for stateless inference but relevant for stateful control components.
DaemonSet	Node agents: GPU Operator operands, log collectors, CNI, node exporters.
Job/CronJob	Batch tasks, validation, maintenance, model prewarming jobs.
HPA/VPA	Pod scaling by metrics or resource recommendations.
PDB	Keeps voluntary disruptions from draining too much capacity.
PriorityClass	Protects critical serving workloads from lower-priority work.

Scheduling Anatomy

When a pod is pending, inspect:

Requested resources: CPU, memory, nvidia.com/gpu, ephemeral storage.
Node selectors and node affinity.
Taints/tolerations.
Pod anti-affinity and topology spread constraints.
PVC binding.
RuntimeClass.
Image architecture.
Quotas and LimitRanges.
PDB only affects eviction, not initial scheduling.

For GPUs, add:

Device plugin healthy and advertising resources.
Node labels for SKU, driver, MIG mode, topology.
GPU fragmentation: total fleet capacity is not the same as fit on one eligible node.
Workload asks for whole GPUs unless MIG/time-slicing/MPS is explicitly configured.

Debugging By Scope

Scope	Examples	Move
One pod	bad config, image, secret, readiness.	`describe`, logs, events, config diff.
One node	runtime, kubelet, disk, driver, GPU error.	cordon, inspect node, compare healthy peer.
One pool/SKU	driver image, taint, GPU Operator operand, hardware class.	halt rollout, compare pool labels and daemonsets.
One cluster	control plane, CNI, DNS, admission, quota.	check apiserver, controllers, CoreDNS, webhooks.
Global	artifact store, registry, identity, upstream service.	failover, dependency status, rate limits.

flowchart TD
  Symptom[Symptom] --> OnePod{One pod}
  OnePod -- yes --> Spec[Spec, image, env, probes, resources]
  OnePod -- no --> OneNode{One node}
  OneNode -- yes --> Node[Runtime, kubelet, CNI, device plugin]
  OneNode -- no --> OnePool{One pool or zone}
  OnePool -- yes --> Capacity[Capacity, labels, taints, topology]
  OnePool -- no --> Cluster[API server, admission, DNS, CNI, dependency]

Admission And Policy

Modern platform teams enforce production contracts at admission:

Required labels/owners.
Resource requests and limits.
Allowed registries and signed images.
No privileged pods except approved system namespaces.
Required readiness/liveness/startup probes.
GPU workloads must declare resource envelope and pool selector.
Runtime image and model artifact must be versioned.

Answers That Sound Senior

Weak:

I would check Kubernetes events and restart the pod.

Strong:

I would first identify whether this is a scheduling, startup, readiness, or serving failure. Then I would scope it to pod, node, pool, cluster, or dependency. For a GPU workload, I would verify scheduler constraints and the device plugin, then drop below Kubernetes to check kubelet, container runtime, driver/toolkit compatibility, DCGM, and kernel GPU errors.

Whiteboard Drill

Question: “A deployment is stuck at 50 percent rollout.”

Answer:

Check rollout status and unavailable replicas.
Inspect new ReplicaSet pods: pending, image pull, crash, readiness, or failed scheduling.
Compare old vs new config: image, env, probes, resources, node selectors, secrets, service account.
Check PDB and maxUnavailable/maxSurge.
If GPU-specific: resource requests, device plugin, node pool labels, model warmup, GPU memory.
Mitigate: pause rollout, rollback if user impact, or adjust rollout strategy.
Prevent: canary gate, startup probe, representative synthetic request, resource-envelope validation.