Automation and IaC Q&A
101. What is the difference between a script and production automation?
Production automation has ownership, tests, dry run, audit logs, metrics, permissions, rollback, error handling, and documented operational boundaries.
102. What does idempotency mean in infrastructure automation?
Running the same operation repeatedly converges to the same intended state without duplicate side effects or escalating risk.
103. What is a fake dry run?
A dry run that skips key API calls, validation, permission checks, capacity checks, or diff generation, so it gives false confidence.
104. What should dry run output include?
Target resources, planned changes, risk level, skipped resources, permissions needed, capacity impact, rollback path, and reasons for each action.
105. Why separate planning from execution?
Planning is deterministic and reviewable. Execution has side effects. Separating them improves testing, auditability, approvals, and rollback decisions.
106. What is the danger of retrying every failed operation?
Retries can duplicate non-idempotent actions, overload dependencies, hide real failures, and make partial state harder to reason about.
107. How do you design safe retries?
Retry only known transient errors, use timeouts, exponential backoff with jitter, idempotency keys, bounded attempts, and visible failure state.
108. What is a reconciliation loop?
A controller pattern that compares desired and observed state, computes a diff, applies safe changes, verifies results, and repeats until convergence.
109. What is reconciliation thrash?
Automation repeatedly changes state back and forth due to stale observations, conflicting controllers, bad desired state, or missing cooldowns.
110. How do you avoid conflicting controllers?
Clear ownership boundaries, field managers, labels/annotations, single source of truth, controller-specific resources, and documented break-glass paths.
111. What is drift?
The live environment differs from declared desired state due to manual changes, failed applies, external mutation, or provider behavior.
112. Is all drift bad?
No. Emergency drift can be necessary. The problem is unmanaged drift without audit, expiry, or reconciliation back to source of truth.
113. What is the GitOps trap during incidents?
Manually hotfixing live state while the reconciler reverts it. Pause intentionally or commit the fix to source of truth.
114. How should break-glass work in GitOps?
Audited pause, explicit owner, time-bounded exception, incident link, post-incident commit or revert, and reconciliation verification.
115. What is Terraform/OpenTofu state risk?
State can contain sensitive data, drift from reality, be locked by failed runs, or be corrupted/mis-scoped causing destructive plans.
116. How do you protect IaC state?
Remote encrypted backend, locking, access control, backups, state separation by blast radius, plan review, and cautious import/move workflows.
117. Why can a Terraform plan be misleading?
Provider bugs, unknown computed values, external drift, data source changes, lifecycle ignores, and apply-time API behavior can differ from plan.
117A. What is the most dangerous Terraform state mistake in production?
Using the wrong backend or workspace for the environment. A prod configuration pointed at dev state, or dev credentials pointed at prod resources, can produce a plan that looks coherent while targeting the wrong blast radius.
117B. When is `terraform force-unlock` appropriate?
Only after proving no apply is still running and the lock is stale. Treat it like an incident action: identify lock owner, timestamp, backend, active CI jobs, approval, audit note, then refresh before any apply.
117C. Why can `count` be dangerous for long-lived resources?
List index changes can shift addresses. Removing element 1 from a list may make Terraform think element 2 should become element 1, causing replacement. Prefer stable for_each keys for persistent infrastructure.
117D. What problem do Terraform `moved` blocks solve?
They tell Terraform that a resource address changed because of a code refactor, not because infrastructure should be destroyed and recreated. They make refactors reviewable and declarative.
117E. What is the trap with `ignore_changes`?
It can intentionally hide fields from Terraform ownership. That is useful for externally managed fields but dangerous when teams forget those fields are ignored and assume plans cover all drift.
118. What is a dangerous IaC module abstraction?
One that hides critical lifecycle, security, or dependency behavior behind defaults users do not understand.
119. What makes a good IaC module?
Clear inputs/outputs, sane defaults, versioning, policy compliance, examples, tests, minimal hidden behavior, and explicit lifecycle constraints.
120. What is the difference between Terraform/OpenTofu and Crossplane?
Terraform/OpenTofu usually runs plans/applies from CI or operators. Crossplane exposes cloud resources as Kubernetes-style APIs reconciled continuously.
121. When is Crossplane risky?
When teams do not understand continuous reconciliation, provider permissions, drift ownership, or the blast radius of cluster-based cloud control planes.
122. What is the senior answer to "Ansible or Terraform"?
Terraform/OpenTofu for declarative infrastructure lifecycle; Ansible for configuration/procedural tasks. Avoid using either where a managed controller or immutable image is cleaner.
123. Why can config management fight immutable infrastructure?
If nodes are mutated after boot, image provenance and reproducibility weaken. Prefer baked images plus small bootstraps for fleet consistency.
124. How do you automate GPU node repair safely?
Detect, correlate, check capacity, cordon, drain respecting PDBs, repair with bounded action, validate GPU stack/workload, uncordon or escalate, and audit.
125. What is the first version of risky automation you should ship?
Recommendation mode that produces repair plans and confidence scores, before automatic execution for proven low-risk cases.
126. What metrics prove automation is working?
MTTR reduction, toil hours reduced, success rate, false positive rate, manual override rate, rollback rate, and incident recurrence reduction.
127. What is a hidden risk of automation success metrics?
High success rate may hide low coverage, overly conservative triggers, or actions that succeed technically but do not improve user impact.
128. What is a dangerous CLI default?
Executing destructive actions by default. Safer CLIs default to dry run, require explicit target scope, and display planned impact.
129. How do you design destructive CLI flags?
Require explicit --execute, target selectors, max concurrency, confirmation/approval token, and visible plan ID.
130. What is the risk of string parsing command output?
CLI output changes and localization can break automation. Prefer structured APIs, JSON output, or client libraries.
131. How do you test infrastructure automation?
Unit-test pure planning logic, integration-test API interactions, run in staging/canary pools, simulate failures, and verify rollback and audit outputs.
132. What is chaos testing useful for here?
Testing detection, isolation, repair, alerting, and human workflows under controlled failure, not randomly breaking production for theater.
133. What should never be hidden by automation?
Risk, ownership, target scope, skipped checks, manual approvals, failed validations, and customer-impacting side effects.
134. What is progressive delivery for infrastructure?
Roll out changes in small segments with health gates, observability, rollback, and promotion criteria rather than applying fleet-wide at once.
135. Why separate driver, node image, and model rollout?
Changing all at once confuses root cause, multiplies blast radius, and makes rollback ambiguous.
136. What is an artifact promotion chain?
A controlled path from build to test to staging/canary to production, preserving artifact identity and validation evidence.
137. Why sign images?
To ensure workloads run trusted artifacts from controlled pipelines, reducing supply-chain and accidental-image risk.
138. What is SBOM value in infra teams?
It supports vulnerability response, provenance, compliance, and understanding what is running when incidents or CVEs hit.
139. What is policy-as-code good for?
Encoding repeatable guardrails for security, reliability, cost, and ownership before changes reach production.
140. What is policy-as-code bad at?
Replacing judgment. Policies can block emergencies, encode stale assumptions, or become noisy if not owned.
141. How do you roll out a new policy safely?
Audit mode first, measure violations, fix high-volume legitimate cases, communicate, then enforce in phases with escape hatch.
142. What is an automation kill switch?
A fast, documented way to stop automated actions globally or by scope when behavior is unsafe.
143. What is a runbook launcher?
Tooling that maps detected incident class to the right checks, dashboards, commands, owners, and safe automation entry points.
144. What makes runbooks rot?
Unowned docs, changing systems, missing validation, incident-only usage, no links to automation, and no post-incident updates.
145. How do you keep runbooks fresh?
Tie them to alerts, test them in drills, version them with code, require post-incident updates, and remove steps replaced by automation.
146. What is a hidden danger of auto-remediation?
It can hide chronic failures, erase evidence, or repeatedly mask a capacity issue until it becomes a larger outage.
147. How do you preserve evidence during remediation?
Snapshot relevant logs/events/metrics/config, annotate incident timeline, retain failed pods where possible, and record automation decisions.
148. What is blast-radius limiting?
Constraining how many resources, tenants, regions, pools, or failure domains an action can affect at once.
149. What is the staff answer to "automate everything"?
Automate repeated, measurable, well-understood workflows with safety controls. Leave rare, ambiguous, high-risk decisions as recommendation-assisted until confidence is proven.
150. What is the strongest automation interview close?
Automation should make the safe path the easy path: plan, validate, execute narrowly, verify, audit, and learn from every exception.