Networking and Traffic

Modern Kubernetes Traffic Vocabulary

Know these:

Service, EndpointSlice, kube-proxy or eBPF replacement.
Ingress for legacy/common HTTP ingress.
Gateway API for role-oriented L4/L7 routing.
Service mesh when mTLS, retries, traffic splitting, and deep telemetry are needed.
Cilium/eBPF for networking, policy, observability, and sometimes kube-proxy replacement.
NetworkPolicy and CiliumNetworkPolicy.
CoreDNS and node-local DNS cache patterns.

flowchart LR
  Client[Client] --> LB[Load balancer]
  LB --> Gateway[Gateway API]
  Gateway --> Mesh[Mesh or eBPF datapath]
  Mesh --> Service[Kubernetes Service]
  Service --> Endpoint[EndpointSlice]
  Endpoint --> Pod[Inference pod]
  Pod --> Triton[Triton]

Gateway API

Gateway API is the modern Kubernetes project for extensible traffic routing. It separates roles:

Platform team owns GatewayClass and Gateway.
App team owns HTTPRoute, GRPCRoute, TLSRoute, or related routes.

Interview use:

I would prefer Gateway API over one-off ingress annotations when platform and application teams need a stable contract for routing, TLS, traffic splitting, and ownership boundaries.

Cilium/eBPF

Why it matters:

eBPF-based datapath.
Network policy enforcement.
Observability into flows.
Potential kube-proxy replacement.
Hubble for network visibility.
Useful when debugging service-to-service connectivity and policy drops.

Staff caution:

eBPF gives powerful visibility and performance, but it does not remove the need to understand DNS, routing, MTU, conntrack-like limits, load balancing, and policy semantics.

Service Mesh Tradeoffs

Use a mesh when:

You need mTLS everywhere.
You need fine-grained traffic splitting.
You need retries/timeouts/circuit breaking as policy.
You need service-to-service telemetry.

Be careful when:

Latency overhead matters.
GPU inference responses are streaming or long-lived.
Retry storms can amplify load.
Sidecars or ambient mesh behavior complicates debugging.

Inference Traffic Controls

Canary by model version.
Shadow traffic for validation.
Weighted routing.
Header/tenant-based routing.
Circuit breakers and concurrency limits.
Request deadlines.
Retry budgets, not unlimited retries.
Load shedding for low-priority tenants.

Debug: Intermittent 503s

Layered checks:

Client-side timeout and retry behavior.
Gateway/ingress logs and route config.
Service endpoints and EndpointSlices.
Pod readiness transitions.
Network policy drops.
DNS resolution and TTL behavior.
Connection pool exhaustion.
Node-level networking or CNI issues.
Upstream model server saturation.

flowchart TD
  Error[Intermittent 503] --> Endpoints{Healthy endpoints exist}
  Endpoints -- no --> Readiness[Readiness, rollout, EndpointSlice]
  Endpoints -- yes --> Policy{Policy or mTLS change}
  Policy -- yes --> PolicyFix[NetworkPolicy, mesh auth, certs]
  Policy -- no --> Overload{Retries or overload}
  Overload -- yes --> Retry[Retry budget, circuit breaking, queue]
  Overload -- no --> Routing[Gateway route weights and DNS]

Debug: DNS Incident

Evidence:

CoreDNS errors and latency.
Node-local DNS cache health.
Query volume and NXDOMAIN rate.
Pod /etc/resolv.conf.
Search path amplification.
External DNS dependency.

Mitigation:

Cache where safe.
Reduce excessive DNS queries.
Pin critical dependencies through stable service discovery.
Add DNS SLOs and synthetic checks.