Mock Loops
Loop A: Infrastructure Deep Dive
Section titled “Loop A: Infrastructure Deep Dive”- A GPU workload is pending even though the fleet has free GPUs. Debug it.
- A node image rollout caused model servers to crash. What evidence do you collect?
- Design labels, taints, and pool segmentation for mixed GPU hardware.
- How would you safely roll out a driver update?
- What metrics would page you?
Scoring:
- Pass: mentions scheduler, device plugin, node labels/taints, fragmentation, driver/runtime.
- Strong: explains validation, rollback, fleet segmentation, and blast-radius control.
Loop B: Automation Design
Section titled “Loop B: Automation Design”- Build an automated GPU node repair workflow.
- What should run in dry-run mode?
- How do you avoid draining too much capacity?
- How do you test the workflow?
- When should automation escalate to humans?
Scoring:
- Pass: has a workflow.
- Strong: has state machine, confidence, guardrails, audit, metrics, and gradual autonomy.
Loop C: Coding
Section titled “Loop C: Coding”- Given node health records, classify nodes into healthy/suspect/quarantine.
- Add max concurrency and protected pool exclusions.
- Add retry with timeout for repair calls.
- Write tests for empty input and partial failure.
- Explain how you would deploy this tool.
Scoring:
- Pass: correct logic.
- Strong: clean separation of pure planning and side-effect execution.
Loop D: System Design
Section titled “Loop D: System Design”- Design a GPU-backed inference platform.
- How do you deploy a new model safely?
- How do you scale under burst traffic?
- How do you handle model artifact storage?
- How do you isolate tenants?
Scoring:
- Pass: reasonable architecture.
- Strong: inference-specific constraints, p99, warmup, batching, GPU memory, rollout gates.
Loop E: Behavioral
Section titled “Loop E: Behavioral”- Tell me about yourself.
- Tell me about a production incident.
- Tell me about automation you built.
- Tell me about a disagreement.
- Why NVIDIA and why this team?
Scoring:
- Pass: clear stories.
- Strong: quantified results and staff-level reflection.