Security and Policy

Baseline

For a senior DevOps/infrastructure automation role, security is part of reliability. A tool that can drain nodes, change routing, or update GPU drivers must have least privilege, auditability, and safe defaults.

flowchart TD
  Request[Automation request] --> Identity[Service identity]
  Identity --> RBAC[Scoped RBAC]
  RBAC --> Policy[Admission and policy checks]
  Policy --> Guardrails[Blast radius guardrails]
  Guardrails --> Action[Approved action]
  Action --> Audit[Audit event and metrics]
  Policy -- denied --> Deny[Reject with reason]

Kubernetes Security Areas

Area	What to know
RBAC	Least privilege roles for humans, automation, and controllers.
Admission control	Enforce production contracts before workloads land.
Pod Security Standards	Restricted/baseline/privileged posture by namespace and workload class.
Secrets	External secrets, KMS, rotation, no secrets in Git.
Network policy	Default deny where practical, explicit service communication.
Image security	Signed images, SBOMs, vulnerability scanning, trusted registries.
Runtime security	Seccomp, AppArmor/SELinux, non-root, read-only filesystems where possible.
Audit logs	Who changed what, when, and through which automation.

Policy Tools

Native admission policies where sufficient.
Kyverno for Kubernetes-native policy authoring and mutation/validation.
OPA Gatekeeper for Rego-based policy.
Sigstore/cosign for image signing workflows.
SOPS/external-secrets for GitOps-friendly secret management.

GPU-Specific Security

Privileged access for GPU stack components should be isolated to system namespaces.
Workloads should not get broad host access just because they need GPUs.
MIG can help isolate tenants on supported hardware.
Time-slicing/MPS may have weaker isolation; be explicit about tenant trust.
Debug containers and nvidia-smi access should respect RBAC and incident workflow.

Automation Permissions

For repair automation:

Read nodes, pods, events, metrics.
Cordon/drain only scoped node pools.
Create events and audit records.
Update labels/taints only in approved prefixes.
No broad cluster-admin unless truly unavoidable.
Separate recommendation mode from execution mode.

Interview Drill

Question: “Your automation needs permissions to reboot or drain GPU nodes. How do you secure it?”

Answer:

Scope RBAC to specific node pools and verbs.
Require dry-run plan and approval for high-risk actions.
Use service identity with short-lived credentials where possible.
Emit audit events for plan, approval, action, and result.
Add concurrency and capacity guardrails.
Keep break-glass separate from normal automation.
Test permissions in staging and verify denied paths.