This is not a buzzword list. Use it to recognize interview prompts and place tools in the architecture.
| Area | Tools/Concepts | What to say |
|---|
| Cluster API | Cluster API, managed K8s APIs | Declarative lifecycle for clusters where applicable. |
| Node autoscaling | Karpenter, Cluster Autoscaler | Match unschedulable pods to capacity; include warmup and GPU readiness. |
| Workload queueing | Kueue, Volcano, YuniKorn | Quota, fair sharing, gang/batch workloads, preemption. |
| Traffic | Gateway API, Envoy Gateway, Istio, Linkerd | Route ownership, traffic splitting, mTLS, retries, p99 impact. |
| Networking | Cilium, Calico, kube-proxy/IPVS/eBPF | Policy, datapath, observability, MTU, DNS, service routing. |
| Storage | CSI, object stores, artifact cache | Model artifacts must be immutable, cached, and checksummed. |
| GPU | NVIDIA GPU Operator, device plugin, DCGM, MIG | Manage full GPU stack, telemetry, and partitioning. |
| Area | Tools/Concepts | What to say |
|---|
| IaC | Terraform, OpenTofu, Pulumi, Crossplane | Declarative provisioning, state, drift, policy, promotion. |
| Config | Helm, Kustomize, Jsonnet, CUE | Rendered diff and validation matter more than template preference. |
| GitOps | Argo CD, Flux | Desired state in Git, controller reconciliation, drift management. |
| Progressive delivery | Argo Rollouts, Flagger | Canary, blue/green, metric gates, traffic shifting. |
| CI | GitHub Actions, GitLab CI, Buildkite, Jenkins | Build/test/sign/render/promote. |
| Supply chain | Sigstore, cosign, SBOM, SLSA, Trivy/Grype | Trusted artifacts and provenance. |
| Area | Tools/Concepts | What to say |
|---|
| Metrics | Prometheus, Mimir, Thanos, VictoriaMetrics | Cardinality, retention, SLOs, alert quality. |
| Dashboards | Grafana | Dashboards should support decisions. |
| Logs | Loki, OpenSearch, Elastic | Correlate with deploys and incidents. |
| Traces | OpenTelemetry, Jaeger, Tempo | Useful for request path and dependency latency. |
| GPU telemetry | DCGM exporter | GPU health, utilization, memory, ECC, Xid. |
| Incident | PagerDuty/Opsgenie, incident docs | On-call hygiene, escalation, postmortems. |
| Language | Strong use |
|---|
| Python | Operational tools, APIs, data processing, fast automation. |
| Go | Kubernetes controllers, CLIs, high-concurrency platform services. |
| Bash | Glue and debugging, but keep production automation bounded. |
| YAML/HCL | Kubernetes/IaC declarations; validate aggressively. |
Be ready for hybrid:
- NVIDIA likely spans cloud and data center patterns.
- Cloud gives managed control planes and autoscaling APIs.
- On-prem/HPC gives hardware locality, storage/network topology, and stronger fleet lifecycle concerns.
- The transferable skill is reasoning across scheduler, node, GPU, network, storage, and deployment contracts.
| JD skill family | Prep pages |
|---|
| Kubernetes | Kubernetes Deep Dive, Kubernetes and GPUs, Scheduling and Autoscaling. |
| DevOps/CI/CD | GitOps and Delivery, Automation and IaC. |
| Infrastructure automation | Automation and IaC, Python and Reliability Code, Break-Fix Automation. |
| Linux systems | Linux, Networking, Storage. |
| Networking | Networking and Traffic, Linux, Networking, Storage. |
| Observability | Observability and Incidents, Modern Tool Matrix. |
| GPU/AI inference | Inference Operations, GPU Operator Runbook, GPU Inference Platform. |
| Security | Security and Policy. |
| Senior leadership | Staff-Level Narratives, 30-60-90 Plan. |