Networking and Traffic
Modern Kubernetes Traffic Vocabulary
Section titled “Modern Kubernetes Traffic Vocabulary”Know these:
- Service, EndpointSlice, kube-proxy or eBPF replacement.
- Ingress for legacy/common HTTP ingress.
- Gateway API for role-oriented L4/L7 routing.
- Service mesh when mTLS, retries, traffic splitting, and deep telemetry are needed.
- Cilium/eBPF for networking, policy, observability, and sometimes kube-proxy replacement.
- NetworkPolicy and CiliumNetworkPolicy.
- CoreDNS and node-local DNS cache patterns.
flowchart LR Client[Client] --> LB[Load balancer] LB --> Gateway[Gateway API] Gateway --> Mesh[Mesh or eBPF datapath] Mesh --> Service[Kubernetes Service] Service --> Endpoint[EndpointSlice] Endpoint --> Pod[Inference pod] Pod --> Triton[Triton]
Gateway API
Section titled “Gateway API”Gateway API is the modern Kubernetes project for extensible traffic routing. It separates roles:
- Platform team owns
GatewayClassandGateway. - App team owns
HTTPRoute,GRPCRoute,TLSRoute, or related routes.
Interview use:
I would prefer Gateway API over one-off ingress annotations when platform and application teams need a stable contract for routing, TLS, traffic splitting, and ownership boundaries.
Cilium/eBPF
Section titled “Cilium/eBPF”Why it matters:
- eBPF-based datapath.
- Network policy enforcement.
- Observability into flows.
- Potential kube-proxy replacement.
- Hubble for network visibility.
- Useful when debugging service-to-service connectivity and policy drops.
Staff caution:
eBPF gives powerful visibility and performance, but it does not remove the need to understand DNS, routing, MTU, conntrack-like limits, load balancing, and policy semantics.
Service Mesh Tradeoffs
Section titled “Service Mesh Tradeoffs”Use a mesh when:
- You need mTLS everywhere.
- You need fine-grained traffic splitting.
- You need retries/timeouts/circuit breaking as policy.
- You need service-to-service telemetry.
Be careful when:
- Latency overhead matters.
- GPU inference responses are streaming or long-lived.
- Retry storms can amplify load.
- Sidecars or ambient mesh behavior complicates debugging.
Inference Traffic Controls
Section titled “Inference Traffic Controls”- Canary by model version.
- Shadow traffic for validation.
- Weighted routing.
- Header/tenant-based routing.
- Circuit breakers and concurrency limits.
- Request deadlines.
- Retry budgets, not unlimited retries.
- Load shedding for low-priority tenants.
Debug: Intermittent 503s
Section titled “Debug: Intermittent 503s”Layered checks:
- Client-side timeout and retry behavior.
- Gateway/ingress logs and route config.
- Service endpoints and EndpointSlices.
- Pod readiness transitions.
- Network policy drops.
- DNS resolution and TTL behavior.
- Connection pool exhaustion.
- Node-level networking or CNI issues.
- Upstream model server saturation.
flowchart TD
Error[Intermittent 503] --> Endpoints{Healthy endpoints exist}
Endpoints -- no --> Readiness[Readiness, rollout, EndpointSlice]
Endpoints -- yes --> Policy{Policy or mTLS change}
Policy -- yes --> PolicyFix[NetworkPolicy, mesh auth, certs]
Policy -- no --> Overload{Retries or overload}
Overload -- yes --> Retry[Retry budget, circuit breaking, queue]
Overload -- no --> Routing[Gateway route weights and DNS]
Debug: DNS Incident
Section titled “Debug: DNS Incident”Evidence:
- CoreDNS errors and latency.
- Node-local DNS cache health.
- Query volume and NXDOMAIN rate.
- Pod
/etc/resolv.conf. - Search path amplification.
- External DNS dependency.
Mitigation:
- Cache where safe.
- Reduce excessive DNS queries.
- Pin critical dependencies through stable service discovery.
- Add DNS SLOs and synthetic checks.