| SLO | Service Level Objective. Target reliability or latency level. |
| SLA | Service Level Agreement. External commitment, often contractual. |
| SLI | Service Level Indicator. Metric used to measure SLO. |
| MTTR | Mean Time To Recovery or Repair. |
| MTTD | Mean Time To Detect. |
| IaC | Infrastructure as Code. |
| PDB | Pod Disruption Budget. Kubernetes object limiting voluntary disruption. |
| CNI | Container Network Interface. Kubernetes networking plugin interface. |
| CSI | Container Storage Interface. Kubernetes storage plugin interface. |
| DCGM | NVIDIA Data Center GPU Manager for GPU telemetry and health. |
| MIG | Multi-Instance GPU, partitioning supported NVIDIA GPUs. |
| CUDA | NVIDIA GPU programming platform/runtime. |
| TensorRT | NVIDIA inference optimization/runtime stack. |
| Triton | NVIDIA inference server. |
| Xid | NVIDIA GPU driver error event identifier. |
| Canary | Limited rollout to detect regressions before broad exposure. |
| Shadow traffic | Mirroring production-like requests to a new version without serving user responses from it. |
| Dynamic batching | Combining requests to improve accelerator utilization while controlling wait time. |
| Tail latency | High percentile latency, usually p95/p99, often more important than average. |
| Drift | Difference between declared infrastructure/config and actual state. |
| Cordon | Mark Kubernetes node unschedulable. |
| Drain | Evict workloads from a Kubernetes node. |