Skip to content

Security and Policy

For a senior DevOps/infrastructure automation role, security is part of reliability. A tool that can drain nodes, change routing, or update GPU drivers must have least privilege, auditability, and safe defaults.

flowchart TD
  Request[Automation request] --> Identity[Service identity]
  Identity --> RBAC[Scoped RBAC]
  RBAC --> Policy[Admission and policy checks]
  Policy --> Guardrails[Blast radius guardrails]
  Guardrails --> Action[Approved action]
  Action --> Audit[Audit event and metrics]
  Policy -- denied --> Deny[Reject with reason]
AreaWhat to know
RBACLeast privilege roles for humans, automation, and controllers.
Admission controlEnforce production contracts before workloads land.
Pod Security StandardsRestricted/baseline/privileged posture by namespace and workload class.
SecretsExternal secrets, KMS, rotation, no secrets in Git.
Network policyDefault deny where practical, explicit service communication.
Image securitySigned images, SBOMs, vulnerability scanning, trusted registries.
Runtime securitySeccomp, AppArmor/SELinux, non-root, read-only filesystems where possible.
Audit logsWho changed what, when, and through which automation.
  • Native admission policies where sufficient.
  • Kyverno for Kubernetes-native policy authoring and mutation/validation.
  • OPA Gatekeeper for Rego-based policy.
  • Sigstore/cosign for image signing workflows.
  • SOPS/external-secrets for GitOps-friendly secret management.
  • Privileged access for GPU stack components should be isolated to system namespaces.
  • Workloads should not get broad host access just because they need GPUs.
  • MIG can help isolate tenants on supported hardware.
  • Time-slicing/MPS may have weaker isolation; be explicit about tenant trust.
  • Debug containers and nvidia-smi access should respect RBAC and incident workflow.

For repair automation:

  • Read nodes, pods, events, metrics.
  • Cordon/drain only scoped node pools.
  • Create events and audit records.
  • Update labels/taints only in approved prefixes.
  • No broad cluster-admin unless truly unavoidable.
  • Separate recommendation mode from execution mode.

Question: “Your automation needs permissions to reboot or drain GPU nodes. How do you secure it?”

Answer:

  1. Scope RBAC to specific node pools and verbs.
  2. Require dry-run plan and approval for high-risk actions.
  3. Use service identity with short-lived credentials where possible.
  4. Emit audit events for plan, approval, action, and result.
  5. Add concurrency and capacity guardrails.
  6. Keep break-glass separate from normal automation.
  7. Test permissions in staging and verify denied paths.