Skip to content

Acronyms and Glossary

TermMeaning
SLOService Level Objective. Target reliability or latency level.
SLAService Level Agreement. External commitment, often contractual.
SLIService Level Indicator. Metric used to measure SLO.
MTTRMean Time To Recovery or Repair.
MTTDMean Time To Detect.
IaCInfrastructure as Code.
PDBPod Disruption Budget. Kubernetes object limiting voluntary disruption.
CNIContainer Network Interface. Kubernetes networking plugin interface.
CSIContainer Storage Interface. Kubernetes storage plugin interface.
DCGMNVIDIA Data Center GPU Manager for GPU telemetry and health.
MIGMulti-Instance GPU, partitioning supported NVIDIA GPUs.
CUDANVIDIA GPU programming platform/runtime.
TensorRTNVIDIA inference optimization/runtime stack.
TritonNVIDIA inference server.
XidNVIDIA GPU driver error event identifier.
CanaryLimited rollout to detect regressions before broad exposure.
Shadow trafficMirroring production-like requests to a new version without serving user responses from it.
Dynamic batchingCombining requests to improve accelerator utilization while controlling wait time.
Tail latencyHigh percentile latency, usually p95/p99, often more important than average.
DriftDifference between declared infrastructure/config and actual state.
CordonMark Kubernetes node unschedulable.
DrainEvict workloads from a Kubernetes node.