Skip to content

Mock Loops

Loop A: Infrastructure Deep Dive

A GPU workload is pending even though the fleet has free GPUs. Debug it.
A node image rollout caused model servers to crash. What evidence do you collect?
Design labels, taints, and pool segmentation for mixed GPU hardware.
How would you safely roll out a driver update?
What metrics would page you?

Scoring:

Pass: mentions scheduler, device plugin, node labels/taints, fragmentation, driver/runtime.
Strong: explains validation, rollback, fleet segmentation, and blast-radius control.

Loop B: Automation Design

Build an automated GPU node repair workflow.
What should run in dry-run mode?
How do you avoid draining too much capacity?
How do you test the workflow?
When should automation escalate to humans?

Scoring:

Pass: has a workflow.
Strong: has state machine, confidence, guardrails, audit, metrics, and gradual autonomy.

Loop C: Coding

Given node health records, classify nodes into healthy/suspect/quarantine.
Add max concurrency and protected pool exclusions.
Add retry with timeout for repair calls.
Write tests for empty input and partial failure.
Explain how you would deploy this tool.

Scoring:

Pass: correct logic.
Strong: clean separation of pure planning and side-effect execution.

Loop D: System Design

Design a GPU-backed inference platform.
How do you deploy a new model safely?
How do you scale under burst traffic?
How do you handle model artifact storage?
How do you isolate tenants?

Scoring:

Pass: reasonable architecture.
Strong: inference-specific constraints, p99, warmup, batching, GPU memory, rollout gates.

Loop E: Behavioral

Tell me about yourself.
Tell me about a production incident.
Tell me about automation you built.
Tell me about a disagreement.
Why NVIDIA and why this team?

Scoring:

Pass: clear stories.
Strong: quantified results and staff-level reflection.