Networking and Incidents Q&A

151. DNS resolves but HTTP fails. What does that prove?

Only name resolution works. TCP routing, network policy, service endpoints, TLS, application readiness, and upstream saturation can still fail.

152. HTTP works by pod IP but not service name. Where do you look?

Service selector, EndpointSlice, kube-proxy/eBPF datapath, DNS, service port/targetPort, network policy, and readiness gates.

153. Service has endpoints but traffic still fails. Why?

Endpoints may be stale, wrong port, wrong protocol, blocked by policy, app not actually healthy, TLS mismatch, or node datapath broken.

154. What is the danger of unlimited service-mesh retries?

They amplify load, hide root cause, increase tail latency, and can overload GPU inference backends with duplicate expensive work.

155. How do you set retry policy for inference?

Use deadlines, retry budgets, narrow retryable status classes, jitter, and avoid retrying after request work has already saturated or exceeded client deadline.

156. What is the difference between Ingress and Gateway API?

Ingress is older and limited. Gateway API is role-oriented, extensible, protocol-aware, and better for platform/app ownership boundaries.

157. What is a ReferenceGrant in Gateway API for?

It permits cross-namespace references so routes can safely target resources across namespace boundaries.

158. Why can traffic splitting be dangerous for model canaries?

Request cost and tenant mix may not split evenly by percentage. A small percentage can still include expensive or high-risk traffic.

159. How do you canary by risk instead of only percentage?

Route by tenant, request class, model version, region, hardware pool, or low-risk cohorts before broad weighted rollout.

160. What is conntrack exhaustion?

The node’s connection tracking table fills, causing new connections to fail or behave intermittently, often seen as random network timeouts.

161. Why might MTU mismatch appear as application flakiness?

Small packets work while larger payloads hang or fragment badly, causing intermittent timeouts especially across overlays/VPNs.

162. How do you debug suspected MTU issues?

Test path MTU with controlled packet sizes, inspect CNI config, overlay settings, cloud networking, and compare failing versus healthy paths.

163. What is a DNS search path amplification issue?

Short names cause multiple DNS queries through search domains, increasing CoreDNS load and latency.

164. How do you reduce DNS load in clusters?

Use fully qualified names for hot paths, node-local DNS cache, tune clients, cache safely, and monitor CoreDNS latency/error/NXDOMAIN rates.

165. What is the trap in saying "Cilium replaces iptables"?

Cilium can use eBPF datapaths and kube-proxy replacement modes, but you still must understand policy, routing, DNS, load balancing, and kernel state.

166. What does Hubble help with?

Flow visibility: which workloads talked, whether policy allowed/dropped traffic, DNS context, and service communication patterns.

167. What is the first question in every incident?

Is there user impact, what is the scope, and who is incident commander?

168. Why separate mitigation from root cause?

Users need recovery before perfect explanation. Root cause can continue after service is stabilized and evidence is preserved.

169. What is a bad incident update?

“Still investigating.” It lacks impact, current hypothesis, action underway, owner, and next update time.

170. What is a good incident update?

“p99 is elevated for model X in us-west GPU pool A since 10:12. Rollout paused; shifting traffic to pool B. Next update in 10 minutes.”

171. What is error budget burn?

The rate at which a service consumes its allowed unreliability. Fast burn alerts catch severe incidents before monthly SLO is exhausted.

172. Why alert on burn rate instead of raw errors only?

It ties alerts to SLO impact and reduces noise from low-impact fluctuations.

173. What is the difference between symptom and cause alerts?

Symptom alerts page on user impact. Cause alerts help diagnosis. Pages should usually be symptom/SLO-driven.

174. Why are GPU Xid alerts not always pages?

Some may be isolated or self-recovered. Page when correlated with workload failure, capacity risk, repeated node issues, or SLO impact.

175. What is alert inhibition?

Suppressing downstream/noisy alerts when a higher-level root alert is firing, reducing duplicate pages.

176. What is cardinality risk in Prometheus?

Too many unique label combinations overload storage/query systems. Tenant, request ID, pod UID, and unbounded model labels can explode cardinality.

177. How do you control metrics cardinality?

Bound labels, aggregate where appropriate, avoid request IDs/user IDs, budget cardinality, and review high-cardinality series.

178. Why can traces be less useful for GPU kernel time?

They show request path timing but may not expose low-level GPU execution details unless integrated with model/runtime telemetry.

179. When are traces highly useful?

Multi-service latency, dependency calls, retries, queueing across services, and finding which hop dominates p99.

180. What is structured logging worth during incidents?

It enables correlation by model, version, tenant, node, GPU ID, request class, rollout, and error code without fragile text parsing.

181. Why can logs become an incident amplifier?

Excessive logging under error conditions increases CPU, disk, network, and backend load, worsening the outage.

182. What is a postmortem action item smell?

“Be more careful.” Good actions change systems: validation, automation, alerting, ownership, rollback, or capacity.

183. What is a strong postmortem action?

“Block model rollout unless synthetic GPU inference passes on target SKU and p99 canary remains below threshold for 30 minutes.”

184. What is MTTR misleading about?

It can hide frequency, customer severity, detection delay, or repeated small incidents. Use with recurrence and error budget impact.

185. What is an incident commander responsible for?

Coordination, scope, communication, decision cadence, role assignment, and keeping mitigation moving without doing every technical task.

186. What should you freeze during major incidents?

Risky deploys, broad automation, autoscaler policy changes, or unrelated infra work that can add noise, unless needed for mitigation.

187. Why preserve failed pods?

They contain logs, exit status, mounted config, runtime state, and evidence that may disappear after restarts.

188. When is rollback not the right answer?

When the issue is capacity, dependency, traffic change, data corruption, or rollback would cause greater risk than targeted mitigation.

189. What is a rollback readiness requirement?

Known-good artifact, compatible config/schema, traffic routing control, data compatibility, and validation that rollback actually restores SLO.

190. What is a synthetic canary blind spot?

It may be too small, too cheap, too cached, not tenant-realistic, or not exercising the GPU/model path.

191. What makes a dashboard useful?

It answers operational questions quickly: impact, scope, layer, change correlation, mitigation effect, and owner.

192. What makes a dashboard decorative?

Pretty graphs without thresholds, labels, SLO context, failure-domain breakdown, or actionability.

193. How do you scope a latency incident?

By tenant, model, version, region, cluster, node pool, GPU SKU, route, request class, and recent change.

194. What is change correlation?

Mapping incidents to deploys, config changes, node rollouts, traffic shifts, policy changes, or dependency events.

195. Why is "no deploy happened" not enough?

Traffic, data, dependencies, certificates, cloud/provider behavior, autoscaling, hardware, and background jobs can change without app deploys.

196. What is the danger of dashboards without deploy markers?

You lose the fastest path to correlate regressions with change and may waste time on unrelated hypotheses.

197. What is brownout mode?

A degraded operating mode that disables noncritical features or lower-priority traffic to preserve core service SLOs.

198. When should you declare an incident?

When user impact or credible risk needs coordinated response. Declaring early is cheaper than chaotic late escalation.

199. What is the staff-level incident answer?

Stabilize users, scope impact, assign roles, test hypotheses by layer, communicate on cadence, mitigate safely, verify recovery, and remove the failure class.

200. What is the best incident prevention philosophy?

Turn repeat pain into contracts, validation, automation, capacity policy, and observability that catches the next failure before customers do.