Skip to content

Linux, Networking, Storage

Senior infrastructure interviews often test whether you can debug beneath Kubernetes. For GPU inference, many failures surface as pod symptoms but originate in the host, network, storage, or runtime.

flowchart TD
  SLO[SLO symptom] --> App[Application and Triton]
  App --> Container[Container runtime]
  Container --> Kernel[Linux kernel]
  Kernel --> Network[Network stack]
  Kernel --> Storage[Storage and filesystem]
  Kernel --> Device[GPU device and driver]
  Network --> NetEvidence[tcpdump, ss, conntrack, DNS]
  Storage --> StorageEvidence[iostat, mount, cache, object store]
  Device --> DeviceEvidence[nvidia-smi, DCGM, dmesg]
AreaCommands and signals
CPU/memorytop, htop, vmstat, free, cgroup stats, OOM events.
Diskdf -h, du, iostat, inode exhaustion, container image garbage collection.
Kerneldmesg, journal logs, OOM killer, driver messages, GPU Xid events.
Processps, strace, file descriptors, zombie processes, stuck syscalls.
Runtimecontainerd/docker logs, kubelet logs, image pull failures, runtime hooks.

Senior-level debugging usually means knowing when Kubernetes abstractions have ended.

SymptomSubstrate issueEvidence
Pod killed with no app stackcgroup memory limit, node OOM, kernel OOM killer, or eviction.kubectl describe pod, kubelet logs, dmesg, cgroup memory events.
Latency spikes under CPU “idle”CPU throttling, steal time, IRQ pressure, noisy host processes, NUMA locality.cgroup throttling metrics, mpstat, pidstat, scheduler stats, node exporter metrics.
Fast local test, slow in clusterDNS search path, MTU, conntrack, proxy/mesh, TLS handshakes, or network policy.packet capture, Hubble/flow logs, ss, DNS metrics, gateway spans.
Image/model load stallsDisk pressure, inode exhaustion, slow overlayfs, registry/object-store latency.kubelet events, container runtime logs, iostat, registry/object-store metrics.
GPU app hangsDriver reset, Xid, blocked syscall, runtime hook, PCIe/NVLink issue.dmesg, DCGM, nvidia-smi, process state, driver/runtime logs.

Interview phrase:

I move downward only with a hypothesis. Kubernetes tells me where the symptom surfaced; Linux tells me what the node actually did.

Topics to know:

  • DNS resolution path and caching.
  • Load balancer health checks.
  • TCP connection states.
  • MTU mismatch and packet fragmentation.
  • Conntrack exhaustion.
  • TLS handshake failures.
  • Network policy / security group / firewall.
  • CNI plugin behavior.
  • Cross-zone or cross-region latency.

Useful commands:

ss -tanp
ip addr
ip route
dig service.namespace.svc.cluster.local
curl -v --resolve host:443:ip https://host/health
tcpdump -i any host <ip>
conntrack -S
  • Conntrack exhaustion can look like random connection failures or intermittent 503s.
  • DNS search path amplification can create high CoreDNS load from one short hostname lookup.
  • NodeLocal DNS cache can reduce latency but is another component to monitor.
  • MTU mismatch often appears only on larger payloads or cross-overlay paths.
  • Retries can multiply connections and make conntrack/gateway saturation worse.

Hard drill:

A small health check succeeds, but real inference requests intermittently timeout across zones.

Answer flow:

  1. Compare payload sizes and paths.
  2. Check gateway and client timeouts.
  3. Inspect MTU and fragmentation symptoms.
  4. Inspect conntrack saturation and connection states.
  5. Test pod-to-pod, pod-to-service, and external-to-gateway paths.
  6. Use packet capture or Cilium/Hubble if available.
  7. Mitigate with route shift, retry budget reduction, or path-specific rollback.

Inference depends on model artifacts. Slow or unreliable artifact paths create production problems:

  • Model downloads delay scale-out.
  • Partial or corrupt artifacts break load.
  • Shared storage can become a bottleneck.
  • Cache invalidation can serve old models.
  • Artifact permissions and encryption can fail after role changes.

Strong answer:

I would treat model artifacts as immutable versioned releases with checksums, provenance, and preflight validation. Scale-out should not depend on every new pod pulling a huge artifact from a fragile path at the same time.

Artifact and volume failures are often state failures:

  • A mutable “latest” model path can make rollback lie.
  • A partially downloaded model can pass existence checks but fail load or correctness.
  • Object-store eventual consistency can expose new config before all artifacts are available.
  • A shared PVC can become a thundering-herd bottleneck during scale-out.
  • Node-local cache can serve stale artifacts without checksum validation.
  • CSI mount latency can look like Kubernetes scheduling delay.

Staff answer:

I want artifact state to be content-addressed or version-addressed, checksum-validated, prewarmed where latency matters, and observable as a first-class deployment dependency.

Debug Drill: New Pods Time Out During Scale-Out

Section titled “Debug Drill: New Pods Time Out During Scale-Out”

Hypotheses:

  • Model artifact download is slow.
  • Image pull is slow.
  • GPU nodes are available but warmup takes longer than autoscaler assumes.
  • DNS or service discovery is delayed.
  • Readiness probe is too shallow or too aggressive.
  • CPU or disk bottleneck during model load.

Evidence:

  • Pod event timeline.
  • Image pull duration.
  • Model load metrics.
  • Disk/network throughput.
  • Readiness transition time.
  • Autoscaler decision logs.
  • Request queue and timeout data.

Do not just list commands. Explain how you narrow layers:

  1. Is the failure request-specific, pod-specific, node-specific, pool-specific, or global?
  2. Did it correlate with deploy, capacity event, node update, traffic shift, or dependency change?
  3. Can I reproduce from inside the pod, on the node, and from outside the cluster?
  4. Which observation would falsify my leading hypothesis?