Acronyms and Glossary

Term	Meaning
SLO	Service Level Objective. Target reliability or latency level.
SLA	Service Level Agreement. External commitment, often contractual.
SLI	Service Level Indicator. Metric used to measure SLO.
MTTR	Mean Time To Recovery or Repair.
MTTD	Mean Time To Detect.
IaC	Infrastructure as Code.
PDB	Pod Disruption Budget. Kubernetes object limiting voluntary disruption.
CNI	Container Network Interface. Kubernetes networking plugin interface.
CSI	Container Storage Interface. Kubernetes storage plugin interface.
DCGM	NVIDIA Data Center GPU Manager for GPU telemetry and health.
MIG	Multi-Instance GPU, partitioning supported NVIDIA GPUs.
CUDA	NVIDIA GPU programming platform/runtime.
TensorRT	NVIDIA inference optimization/runtime stack.
Triton	NVIDIA inference server.
Xid	NVIDIA GPU driver error event identifier.
Canary	Limited rollout to detect regressions before broad exposure.
Shadow traffic	Mirroring production-like requests to a new version without serving user responses from it.
Dynamic batching	Combining requests to improve accelerator utilization while controlling wait time.
Tail latency	High percentile latency, usually p95/p99, often more important than average.
Drift	Difference between declared infrastructure/config and actual state.
Cordon	Mark Kubernetes node unschedulable.
Drain	Evict workloads from a Kubernetes node.