Kubernetes Deep Dive
2026 Baseline
Section titled “2026 Baseline”As of June 2026, talk in terms of Kubernetes 1.35/1.36 in production and 1.37 as upcoming. Do not anchor your answers to old assumptions like PodSecurityPolicy, beta Ingress-only traffic management, or hand-managed GPU daemonsets.
Modern Kubernetes vocabulary for this role:
- Gateway API for expressive traffic management.
- Device plugins and Dynamic Resource Allocation concepts for specialized hardware.
- GPU Operator instead of bespoke driver/toolkit installs.
- Kueue for quota-aware AI/ML batch and queued workloads.
- Karpenter or cloud-native node autoscaling for right-sized node provisioning.
- Cilium/eBPF for networking, security, and packet-level visibility.
- OpenTelemetry plus Prometheus-compatible metrics for portable observability.
- GitOps with Argo CD or Flux for declarative cluster state.
- Policy engines: Kyverno, OPA Gatekeeper, ValidatingAdmissionPolicy.
flowchart TD Desired[Desired state in API server] --> Scheduler[Scheduler] Scheduler --> Node[Kubelet on selected node] Node --> Runtime[Container runtime] Runtime --> Pod[Pod containers] Controller[Controllers] --> Desired CNI[CNI] --> Pod CSI[CSI] --> Pod Device[Device plugin] --> Node Observability[Events and metrics] -. feedback .-> Controller
Control Plane
Section titled “Control Plane”Know the pieces and failure implications:
| Component | What it does | Interview failure mode |
|---|---|---|
| API server | Front door for cluster state. | Latency/errors block deploys and controllers, but existing data plane may keep serving. |
| etcd | Durable cluster state. | Quorum, disk latency, compaction, snapshot restore, backup validation. |
| scheduler | Assigns pods to nodes. | Pending pods, topology constraints, resource fragmentation, stale assumptions. |
| controller manager | Reconciliation controllers. | Deployments, nodes, endpoints, jobs stop converging. |
| cloud controller manager | Cloud provider integration. | Load balancers, routes, node lifecycle integration. |
Strong line:
I separate control-plane availability from data-plane serving. A control-plane issue can prevent changes while existing inference pods continue serving, so mitigation depends on whether the user path or the operations path is impaired.
API Server Internals
Section titled “API Server Internals”The API server is not only a REST endpoint. It is the admission, validation, persistence, and watch boundary for the cluster.
Request path:
- Authentication.
- Authorization.
- Mutating admission.
- Object schema/defaulting/validation.
- Validating admission.
- Persistence to etcd.
- Watch notification to controllers, schedulers, kubelets, and clients.
sequenceDiagram participant User as Client or controller participant API as API server participant Auth as Authn/Authz participant Admit as Admission chain participant Etcd as etcd participant Watch as Watch cache User->>API: create or update object API->>Auth: authenticate and authorize API->>Admit: mutate then validate API->>Etcd: persist object Etcd-->>API: committed revision API->>Watch: publish watch event Watch-->>User: observed state changes
Failure modes to speak to:
- API server read latency:
kubectland controllers slow down, but existing pods may keep serving. - API server write failure: rollouts, scaling, leader election renewals, and status updates can fail.
- Admission webhook timeout: creates a deploy outage even when workloads are otherwise healthy.
- Watch cache pressure: controllers lag, endpoints stale, and status appears inconsistent.
- Request priority and fairness misconfiguration: noisy controllers can starve human or critical control traffic.
Strong answer:
I distinguish read path, write path, admission path, and watch path. “The API server is slow” is not precise enough; the mitigation differs if admission is blocking writes, etcd is slow, or controllers are simply behind on watches.
Etcd Realities
Section titled “Etcd Realities”Etcd is the durable source of Kubernetes state. The main interview points are quorum, latency, compaction, and restore discipline.
Know:
- Etcd needs quorum; losing quorum makes writes unavailable.
- Disk latency matters because every committed write goes through durable storage.
- Large objects and high object churn increase pressure.
- Watch history depends on revisions and compaction.
- Defragmentation and compaction are operational tasks, not trivia.
- Snapshots are only useful if restore is tested.
Incident drill:
Deploys are failing and API server logs show etcd request timeouts.
Senior flow:
- Stop nonessential controllers or deploy automation if they are amplifying writes.
- Check etcd quorum, leader changes, disk latency, database size, and network between control-plane nodes.
- Determine whether data-plane serving is still healthy.
- Avoid mass restarts of control-plane components without evidence.
- If restore is needed, follow tested snapshot restore process and validate cluster revision behavior.
- After recovery, reduce object churn, tune compaction/defrag, and add alerts on latency before timeout.
Data Plane
Section titled “Data Plane”Key layers:
- Kubelet: pod lifecycle, volume mount, device allocation, node status.
- Container runtime: containerd/CRI-O, image pulls, runtime hooks.
- CNI: pod networking, service routing, network policy.
- CSI: volume provisioning and mount.
- Device plugins: GPUs, NICs, FPGAs, other specialized resources.
- Node OS/kernel/driver stack.
Kubelet And Pod Startup Internals
Section titled “Kubelet And Pod Startup Internals”Kubelet is the node-side reconciler. It observes assigned pods, prepares dependencies, asks CRI to run containers, reports status, and executes probes.
Startup path:
flowchart TD Assigned[Pod assigned to node] --> Pull[Pull images] Pull --> Volumes[Mount volumes and projected secrets] Volumes --> Devices[Allocate devices and runtime class] Devices --> Sandbox[Create pod sandbox] Sandbox --> Containers[Start init and app containers] Containers --> Probes[Startup and readiness probes] Probes --> Ready[Ready condition and EndpointSlice update]
Failure interpretation:
ContainerCreating: image pull, volume mount, CNI, sandbox, or device allocation.CrashLoopBackOff: process exits after container starts; look at previous logs and exit code.- Ready false: app/probe/sidecar/model readiness, not necessarily Kubernetes failure.
- Running but no endpoint: readiness gate, selector mismatch, EndpointSlice controller lag, or route binding issue.
- GPU allocated but invisible: device plugin allocation, container runtime hook, driver/toolkit, or security context.
Scheduler Internals
Section titled “Scheduler Internals”The scheduler generally works through queue, filter, score, reserve, permit, bind, then post-bind behavior. For interviews, use those phases to explain pending pods precisely.
flowchart LR Pending[Pending pod] --> Queue[Scheduling queue] Queue --> Filter[Filter infeasible nodes] Filter --> Score[Score feasible nodes] Score --> Reserve[Reserve resources] Reserve --> Permit[Permit or wait] Permit --> Bind[Bind pod to node] Bind --> Kubelet[Kubelet starts pod]
Common GPU scheduling traps:
- A pod requests
nvidia.com/gpu: 4; four free GPUs across four nodes do not help. - A topology constraint can make a resource look unavailable even when allocatable is nonzero.
- A stale device plugin can leave allocatable wrong until kubelet/plugin state converges.
- Preemption can find a victim but still fail if PDBs, affinity, or resource shape prevent fit.
- Autoscaler can add nodes that are not usable until GPU Operator operands validate the stack.
Workload APIs
Section titled “Workload APIs”| API | Use |
|---|---|
| Deployment | Stateless services, model servers with normal rollout semantics. |
| StatefulSet | Stable identity/storage, less common for stateless inference but relevant for stateful control components. |
| DaemonSet | Node agents: GPU Operator operands, log collectors, CNI, node exporters. |
| Job/CronJob | Batch tasks, validation, maintenance, model prewarming jobs. |
| HPA/VPA | Pod scaling by metrics or resource recommendations. |
| PDB | Keeps voluntary disruptions from draining too much capacity. |
| PriorityClass | Protects critical serving workloads from lower-priority work. |
Scheduling Anatomy
Section titled “Scheduling Anatomy”When a pod is pending, inspect:
- Requested resources: CPU, memory,
nvidia.com/gpu, ephemeral storage. - Node selectors and node affinity.
- Taints/tolerations.
- Pod anti-affinity and topology spread constraints.
- PVC binding.
- RuntimeClass.
- Image architecture.
- Quotas and LimitRanges.
- PDB only affects eviction, not initial scheduling.
For GPUs, add:
- Device plugin healthy and advertising resources.
- Node labels for SKU, driver, MIG mode, topology.
- GPU fragmentation: total fleet capacity is not the same as fit on one eligible node.
- Workload asks for whole GPUs unless MIG/time-slicing/MPS is explicitly configured.
Debugging By Scope
Section titled “Debugging By Scope”| Scope | Examples | Move |
|---|---|---|
| One pod | bad config, image, secret, readiness. | describe, logs, events, config diff. |
| One node | runtime, kubelet, disk, driver, GPU error. | cordon, inspect node, compare healthy peer. |
| One pool/SKU | driver image, taint, GPU Operator operand, hardware class. | halt rollout, compare pool labels and daemonsets. |
| One cluster | control plane, CNI, DNS, admission, quota. | check apiserver, controllers, CoreDNS, webhooks. |
| Global | artifact store, registry, identity, upstream service. | failover, dependency status, rate limits. |
flowchart TD
Symptom[Symptom] --> OnePod{One pod}
OnePod -- yes --> Spec[Spec, image, env, probes, resources]
OnePod -- no --> OneNode{One node}
OneNode -- yes --> Node[Runtime, kubelet, CNI, device plugin]
OneNode -- no --> OnePool{One pool or zone}
OnePool -- yes --> Capacity[Capacity, labels, taints, topology]
OnePool -- no --> Cluster[API server, admission, DNS, CNI, dependency]
Admission And Policy
Section titled “Admission And Policy”Modern platform teams enforce production contracts at admission:
- Required labels/owners.
- Resource requests and limits.
- Allowed registries and signed images.
- No privileged pods except approved system namespaces.
- Required readiness/liveness/startup probes.
- GPU workloads must declare resource envelope and pool selector.
- Runtime image and model artifact must be versioned.
Answers That Sound Senior
Section titled “Answers That Sound Senior”Weak:
I would check Kubernetes events and restart the pod.
Strong:
I would first identify whether this is a scheduling, startup, readiness, or serving failure. Then I would scope it to pod, node, pool, cluster, or dependency. For a GPU workload, I would verify scheduler constraints and the device plugin, then drop below Kubernetes to check kubelet, container runtime, driver/toolkit compatibility, DCGM, and kernel GPU errors.
Whiteboard Drill
Section titled “Whiteboard Drill”Question: “A deployment is stuck at 50 percent rollout.”
Answer:
- Check rollout status and unavailable replicas.
- Inspect new ReplicaSet pods: pending, image pull, crash, readiness, or failed scheduling.
- Compare old vs new config: image, env, probes, resources, node selectors, secrets, service account.
- Check PDB and maxUnavailable/maxSurge.
- If GPU-specific: resource requests, device plugin, node pool labels, model warmup, GPU memory.
- Mitigate: pause rollout, rollback if user impact, or adjust rollout strategy.
- Prevent: canary gate, startup probe, representative synthetic request, resource-envelope validation.