Skip to content

Mock Loops

  1. A GPU workload is pending even though the fleet has free GPUs. Debug it.
  2. A node image rollout caused model servers to crash. What evidence do you collect?
  3. Design labels, taints, and pool segmentation for mixed GPU hardware.
  4. How would you safely roll out a driver update?
  5. What metrics would page you?

Scoring:

  • Pass: mentions scheduler, device plugin, node labels/taints, fragmentation, driver/runtime.
  • Strong: explains validation, rollback, fleet segmentation, and blast-radius control.
  1. Build an automated GPU node repair workflow.
  2. What should run in dry-run mode?
  3. How do you avoid draining too much capacity?
  4. How do you test the workflow?
  5. When should automation escalate to humans?

Scoring:

  • Pass: has a workflow.
  • Strong: has state machine, confidence, guardrails, audit, metrics, and gradual autonomy.
  1. Given node health records, classify nodes into healthy/suspect/quarantine.
  2. Add max concurrency and protected pool exclusions.
  3. Add retry with timeout for repair calls.
  4. Write tests for empty input and partial failure.
  5. Explain how you would deploy this tool.

Scoring:

  • Pass: correct logic.
  • Strong: clean separation of pure planning and side-effect execution.
  1. Design a GPU-backed inference platform.
  2. How do you deploy a new model safely?
  3. How do you scale under burst traffic?
  4. How do you handle model artifact storage?
  5. How do you isolate tenants?

Scoring:

  • Pass: reasonable architecture.
  • Strong: inference-specific constraints, p99, warmup, batching, GPU memory, rollout gates.
  1. Tell me about yourself.
  2. Tell me about a production incident.
  3. Tell me about automation you built.
  4. Tell me about a disagreement.
  5. Why NVIDIA and why this team?

Scoring:

  • Pass: clear stories.
  • Strong: quantified results and staff-level reflection.