Skip to content

Networking and Traffic

Know these:

  • Service, EndpointSlice, kube-proxy or eBPF replacement.
  • Ingress for legacy/common HTTP ingress.
  • Gateway API for role-oriented L4/L7 routing.
  • Service mesh when mTLS, retries, traffic splitting, and deep telemetry are needed.
  • Cilium/eBPF for networking, policy, observability, and sometimes kube-proxy replacement.
  • NetworkPolicy and CiliumNetworkPolicy.
  • CoreDNS and node-local DNS cache patterns.
flowchart LR
  Client[Client] --> LB[Load balancer]
  LB --> Gateway[Gateway API]
  Gateway --> Mesh[Mesh or eBPF datapath]
  Mesh --> Service[Kubernetes Service]
  Service --> Endpoint[EndpointSlice]
  Endpoint --> Pod[Inference pod]
  Pod --> Triton[Triton]

Gateway API is the modern Kubernetes project for extensible traffic routing. It separates roles:

  • Platform team owns GatewayClass and Gateway.
  • App team owns HTTPRoute, GRPCRoute, TLSRoute, or related routes.

Interview use:

I would prefer Gateway API over one-off ingress annotations when platform and application teams need a stable contract for routing, TLS, traffic splitting, and ownership boundaries.

Why it matters:

  • eBPF-based datapath.
  • Network policy enforcement.
  • Observability into flows.
  • Potential kube-proxy replacement.
  • Hubble for network visibility.
  • Useful when debugging service-to-service connectivity and policy drops.

Staff caution:

eBPF gives powerful visibility and performance, but it does not remove the need to understand DNS, routing, MTU, conntrack-like limits, load balancing, and policy semantics.

Use a mesh when:

  • You need mTLS everywhere.
  • You need fine-grained traffic splitting.
  • You need retries/timeouts/circuit breaking as policy.
  • You need service-to-service telemetry.

Be careful when:

  • Latency overhead matters.
  • GPU inference responses are streaming or long-lived.
  • Retry storms can amplify load.
  • Sidecars or ambient mesh behavior complicates debugging.
  • Canary by model version.
  • Shadow traffic for validation.
  • Weighted routing.
  • Header/tenant-based routing.
  • Circuit breakers and concurrency limits.
  • Request deadlines.
  • Retry budgets, not unlimited retries.
  • Load shedding for low-priority tenants.

Layered checks:

  1. Client-side timeout and retry behavior.
  2. Gateway/ingress logs and route config.
  3. Service endpoints and EndpointSlices.
  4. Pod readiness transitions.
  5. Network policy drops.
  6. DNS resolution and TTL behavior.
  7. Connection pool exhaustion.
  8. Node-level networking or CNI issues.
  9. Upstream model server saturation.
flowchart TD
  Error[Intermittent 503] --> Endpoints{Healthy endpoints exist}
  Endpoints -- no --> Readiness[Readiness, rollout, EndpointSlice]
  Endpoints -- yes --> Policy{Policy or mTLS change}
  Policy -- yes --> PolicyFix[NetworkPolicy, mesh auth, certs]
  Policy -- no --> Overload{Retries or overload}
  Overload -- yes --> Retry[Retry budget, circuit breaking, queue]
  Overload -- no --> Routing[Gateway route weights and DNS]

Evidence:

  • CoreDNS errors and latency.
  • Node-local DNS cache health.
  • Query volume and NXDOMAIN rate.
  • Pod /etc/resolv.conf.
  • Search path amplification.
  • External DNS dependency.

Mitigation:

  • Cache where safe.
  • Reduce excessive DNS queries.
  • Pin critical dependencies through stable service discovery.
  • Add DNS SLOs and synthetic checks.