Security and Policy
Baseline
Section titled “Baseline”For a senior DevOps/infrastructure automation role, security is part of reliability. A tool that can drain nodes, change routing, or update GPU drivers must have least privilege, auditability, and safe defaults.
flowchart TD Request[Automation request] --> Identity[Service identity] Identity --> RBAC[Scoped RBAC] RBAC --> Policy[Admission and policy checks] Policy --> Guardrails[Blast radius guardrails] Guardrails --> Action[Approved action] Action --> Audit[Audit event and metrics] Policy -- denied --> Deny[Reject with reason]
Kubernetes Security Areas
Section titled “Kubernetes Security Areas”| Area | What to know |
|---|---|
| RBAC | Least privilege roles for humans, automation, and controllers. |
| Admission control | Enforce production contracts before workloads land. |
| Pod Security Standards | Restricted/baseline/privileged posture by namespace and workload class. |
| Secrets | External secrets, KMS, rotation, no secrets in Git. |
| Network policy | Default deny where practical, explicit service communication. |
| Image security | Signed images, SBOMs, vulnerability scanning, trusted registries. |
| Runtime security | Seccomp, AppArmor/SELinux, non-root, read-only filesystems where possible. |
| Audit logs | Who changed what, when, and through which automation. |
Policy Tools
Section titled “Policy Tools”- Native admission policies where sufficient.
- Kyverno for Kubernetes-native policy authoring and mutation/validation.
- OPA Gatekeeper for Rego-based policy.
- Sigstore/cosign for image signing workflows.
- SOPS/external-secrets for GitOps-friendly secret management.
GPU-Specific Security
Section titled “GPU-Specific Security”- Privileged access for GPU stack components should be isolated to system namespaces.
- Workloads should not get broad host access just because they need GPUs.
- MIG can help isolate tenants on supported hardware.
- Time-slicing/MPS may have weaker isolation; be explicit about tenant trust.
- Debug containers and
nvidia-smiaccess should respect RBAC and incident workflow.
Automation Permissions
Section titled “Automation Permissions”For repair automation:
- Read nodes, pods, events, metrics.
- Cordon/drain only scoped node pools.
- Create events and audit records.
- Update labels/taints only in approved prefixes.
- No broad cluster-admin unless truly unavoidable.
- Separate recommendation mode from execution mode.
Interview Drill
Section titled “Interview Drill”Question: “Your automation needs permissions to reboot or drain GPU nodes. How do you secure it?”
Answer:
- Scope RBAC to specific node pools and verbs.
- Require dry-run plan and approval for high-risk actions.
- Use service identity with short-lived credentials where possible.
- Emit audit events for plan, approval, action, and result.
- Add concurrency and capacity guardrails.
- Keep break-glass separate from normal automation.
- Test permissions in staging and verify denied paths.