Automation and IaC

Automation Philosophy

Good automation is not “a script that works once.” It is a controlled operational workflow:

Idempotent
Observable
Authenticated and authorized
Auditable
Dry-run capable
Blast-radius limited
Rollback-aware
Tested against real failure cases

Automation Maturity Ladder

Level	Description
Manual runbook	Human executes commands. Useful for discovery, risky at scale.
Script	Repeatable but often lacks safety, state, and audit.
Tool	Has inputs, validation, logs, errors, dry run, and tests.
Workflow	Orchestrates multiple steps with gates, retries, and approvals.
Platform primitive	Owned, documented, observable, integrated into normal operations.

flowchart TD
  Manual[Manual runbook] --> Script[Scripted task]
  Script --> Idempotent[Idempotent tool]
  Idempotent --> Workflow[Workflow with gates]
  Workflow --> Primitive[Platform primitive]
  Primitive --> Learning[Incident learning and prevention]

IaC Interview Points

Discuss:

State management and locking.
Drift detection.
Immutable artifacts and version pinning.
Module boundaries.
Environment promotion.
Secret handling.
Policy-as-code.
Rollback and import strategy for existing resources.
Testing: lint, plan review, integration, post-apply validation.

Terraform/OpenTofu State Quirks

State is not bookkeeping trivia. It is the authority Terraform/OpenTofu uses to map configuration addresses to real infrastructure IDs. Senior interviewers will expect you to know where state lies, where it protects you, and where it can betray you.

Quirk	Why it matters	Senior response
Stale state	A plan can be based on outdated observations if refresh is skipped, blocked, or racing external changes.	Refresh intentionally, inspect provider/API health, and avoid approving high-risk plans from stale CI.
Lock stuck after failed apply	The backend may still hold a lock after a crashed job.	Confirm no active run, inspect backend metadata, then force-unlock only with audit and owner approval.
Sensitive values in state	Providers often store generated passwords, tokens, connection strings, or private endpoints in state.	Encrypt backend, restrict access, avoid broad CI logs/artifacts, and rotate if state exposure occurs.
`count` index churn	Removing an item from the middle of a counted list can shift addresses and propose destructive replacements.	Prefer `for_each` with stable keys for long-lived resources.
Provider schema change	Upgrading providers can reinterpret defaults or computed fields and create noisy or dangerous plans.	Pin providers, review changelogs, test upgrade plans in lower environments, and upgrade deliberately.
Import mismatch	Imported resources can have config that does not fully match live defaults, causing immediate drift.	Import, then reconcile config until plan is empty or intentionally documented.
Moved resource addresses	Refactors without `moved` blocks look like destroy/create even when infrastructure should be preserved.	Use `moved` blocks for code refactors and review address changes as production-risk changes.
`ignore_changes` abuse	It can hide real drift and make Terraform stop owning fields people assume it owns.	Use narrowly, document owner of ignored fields, and alert on ignored-field drift if it affects safety.
Data source instability	A data source result can change between plan and apply, especially for “latest” images or broad tag selectors.	Pin immutable IDs for production or fail the plan if selected artifacts are not approved.
Backend mis-scope	Accidentally pointing prod config at dev state or vice versa can create catastrophic plans.	Isolate credentials/backends, print workspace/backend in CI, and require environment-specific review.
Partial apply	Some resources may be changed before a later API call fails.	Treat failed apply as an incident: refresh state, inspect real resources, and converge deliberately.
Manual state surgery	`state rm`, `state mv`, and direct state edits can sever ownership or corrupt dependency assumptions.	Prefer declarative `moved` and import flows; require peer review, backup, and rollback plan for state.

flowchart TD
  Change[HCL change] --> Init[Init backend and providers]
  Init --> Refresh[Refresh observed state]
  Refresh --> Plan[Plan graph diff]
  Plan --> Review{Plan matches intent}
  Review -- no --> Fix[Fix config, import, moved block, or drift]
  Review -- yes --> Lock[Acquire state lock]
  Lock --> Apply[Apply actions]
  Apply --> Persist[Persist new state]
  Persist --> Validate[Post-apply validation]
  Validate --> Done[Release lock and record evidence]

State Incident Drills

Prompt: “A Terraform plan wants to replace the production GPU node pool after a harmless module refactor.”

Strong answer:

Stop the pipeline before apply.
Identify whether resource addresses changed because of module path, count index, for_each key, provider alias, or workspace/backend mismatch.
Compare current state addresses with intended config addresses.
Use moved blocks or terraform state mv only after review and backup.
Re-plan until the diff expresses only intended changes.
Add a CI gate that fails plans with unexpected destroy/replace actions in protected environments.

Prompt: “State lock is stuck during an incident rollback.”

Strong answer:

Confirm no apply is still running in CI, local machines, or an automation controller.
Inspect backend lock metadata and timestamps.
Decide if the incident risk justifies force-unlock.
Force-unlock only with incident commander approval and audit note.
Immediately refresh and inspect state before any apply.
Capture prevention: job timeout cleanup, clearer lock ownership, and runbook for backend-specific locks.

Prompt: “A provider upgrade creates a huge no-op-looking plan with default tag and metadata changes.”

Strong answer:

Treat provider upgrades as infrastructure changes, not dependency chores.
Read provider release notes for schema/default changes.
Test in lower environment with production-like resources.
Separate provider upgrade from functional infrastructure changes.
Add plan filters or policy checks for replacement/destruction.
Roll out by environment and keep the previous lockfile/provider version available.

Safe Repair Workflow

Example: automated GPU node quarantine and repair.

Detect unhealthy condition through trusted signals: DCGM, kubelet, pod failures, kernel logs.
Correlate signals to avoid false positives.
Cordon node and mark reason.
Respect PDBs and workload priority before drain.
Drain with timeout and exception handling.
Run repair action: restart runtime, reload driver if safe, reboot, or move to hardware diagnostics.
Validate node: driver, device plugin, DCGM, test pod, workload canary.
Uncordon only after validation.
Emit event, metrics, audit log, and incident annotation.

stateDiagram-v2
  [*] --> Detect
  Detect --> Diagnose
  Diagnose --> Plan
  Plan --> GuardrailCheck
  GuardrailCheck --> Execute: approved
  GuardrailCheck --> HumanReview: risky
  Execute --> Verify
  Verify --> Record
  HumanReview --> Execute: approved
  HumanReview --> Record: rejected
  Record --> [*]

Guardrails

Max concurrent nodes per pool.
Exclude critical pools unless explicit flag.
Require approval for destructive or customer-impacting actions.
Precheck SLO and capacity headroom before drain.
Kill switch.
Reconciliation loop should converge, not thrash.

CI/CD For Infrastructure

Pipeline stages:

Static validation: formatting, lint, schema, policy.
Unit tests for modules and scripts.
Plan/diff generation.
Human review for high-risk change.
Apply to dev/staging/canary pool.
Post-apply health checks.
Progressive promotion.
Automatic rollback or halt on health gate failure.

Senior Answer Frame

Prompt: “How do you decide what to automate?”

Answer:

I prioritize workflows that are frequent, high-risk, time-sensitive, or require precision under incident pressure. I start by measuring toil and incident contribution, then build a small workflow with prechecks and observability. If it proves value, I turn it into a supported path with tests, docs, ownership, and adoption metrics.