Final 24 Hours
One-Minute Pitch
Section titled “One-Minute Pitch”I am a systems engineer focused on reliable production infrastructure and automation. My best work is in the space between orchestration, host-level debugging, deployment safety, and operational tooling. For this NVIDIA role, the fit is AI inference operations: making GPU-backed services deploy safely, recover quickly, and operate with clear signals instead of manual guesswork.
Five Phrases To Use
Section titled “Five Phrases To Use”- “I would first separate user impact from root cause.”
- “I want a dry-run plan before any destructive action.”
- “The scheduler view is necessary but not sufficient; I would verify the node and GPU layer.”
- “Rollback needs to be faster and safer than rollout.”
- “The durable fix is to remove the failure class, not only repair the instance.”
Five Things Not To Say
Section titled “Five Things Not To Say”- “Just restart it.”
- “Kubernetes handles that.”
- “We can automate all of it immediately.”
- “I would add a dashboard” without saying what decision it supports.
- “I have not worked on GPUs, so…” without pivoting to transferable systems depth.
Last Review Topics
Section titled “Last Review Topics”- Kubernetes GPU device plugin, GPU Operator, DCGM, MIG.
- Inference p99, batching, warmup, model artifact flow.
- IaC drift, state, policy, rollback.
- Incident command script.
- Staff narratives.
Whiteboard Skeleton
Section titled “Whiteboard Skeleton”Requirements
SLO, scale, tenants, hardware, regions
Architecture
control plane, data plane, registry, scheduler, node pools
Operations
deploy, observe, alert, repair, rollback
Failure modes
model, runtime, node, GPU, network, storage, scheduler
Tradeoffs
utilization vs isolation, automation vs control, central platform vs flexibility
Your Closing Question
Section titled “Your Closing Question”What is the highest-leverage operational problem this team wants the person in this role to solve in the first six months?