Skip to content

Automation and IaC

Good automation is not “a script that works once.” It is a controlled operational workflow:

  • Idempotent
  • Observable
  • Authenticated and authorized
  • Auditable
  • Dry-run capable
  • Blast-radius limited
  • Rollback-aware
  • Tested against real failure cases
LevelDescription
Manual runbookHuman executes commands. Useful for discovery, risky at scale.
ScriptRepeatable but often lacks safety, state, and audit.
ToolHas inputs, validation, logs, errors, dry run, and tests.
WorkflowOrchestrates multiple steps with gates, retries, and approvals.
Platform primitiveOwned, documented, observable, integrated into normal operations.
flowchart TD
  Manual[Manual runbook] --> Script[Scripted task]
  Script --> Idempotent[Idempotent tool]
  Idempotent --> Workflow[Workflow with gates]
  Workflow --> Primitive[Platform primitive]
  Primitive --> Learning[Incident learning and prevention]

Discuss:

  • State management and locking.
  • Drift detection.
  • Immutable artifacts and version pinning.
  • Module boundaries.
  • Environment promotion.
  • Secret handling.
  • Policy-as-code.
  • Rollback and import strategy for existing resources.
  • Testing: lint, plan review, integration, post-apply validation.

State is not bookkeeping trivia. It is the authority Terraform/OpenTofu uses to map configuration addresses to real infrastructure IDs. Senior interviewers will expect you to know where state lies, where it protects you, and where it can betray you.

QuirkWhy it mattersSenior response
Stale stateA plan can be based on outdated observations if refresh is skipped, blocked, or racing external changes.Refresh intentionally, inspect provider/API health, and avoid approving high-risk plans from stale CI.
Lock stuck after failed applyThe backend may still hold a lock after a crashed job.Confirm no active run, inspect backend metadata, then force-unlock only with audit and owner approval.
Sensitive values in stateProviders often store generated passwords, tokens, connection strings, or private endpoints in state.Encrypt backend, restrict access, avoid broad CI logs/artifacts, and rotate if state exposure occurs.
count index churnRemoving an item from the middle of a counted list can shift addresses and propose destructive replacements.Prefer for_each with stable keys for long-lived resources.
Provider schema changeUpgrading providers can reinterpret defaults or computed fields and create noisy or dangerous plans.Pin providers, review changelogs, test upgrade plans in lower environments, and upgrade deliberately.
Import mismatchImported resources can have config that does not fully match live defaults, causing immediate drift.Import, then reconcile config until plan is empty or intentionally documented.
Moved resource addressesRefactors without moved blocks look like destroy/create even when infrastructure should be preserved.Use moved blocks for code refactors and review address changes as production-risk changes.
ignore_changes abuseIt can hide real drift and make Terraform stop owning fields people assume it owns.Use narrowly, document owner of ignored fields, and alert on ignored-field drift if it affects safety.
Data source instabilityA data source result can change between plan and apply, especially for “latest” images or broad tag selectors.Pin immutable IDs for production or fail the plan if selected artifacts are not approved.
Backend mis-scopeAccidentally pointing prod config at dev state or vice versa can create catastrophic plans.Isolate credentials/backends, print workspace/backend in CI, and require environment-specific review.
Partial applySome resources may be changed before a later API call fails.Treat failed apply as an incident: refresh state, inspect real resources, and converge deliberately.
Manual state surgerystate rm, state mv, and direct state edits can sever ownership or corrupt dependency assumptions.Prefer declarative moved and import flows; require peer review, backup, and rollback plan for state.
flowchart TD
  Change[HCL change] --> Init[Init backend and providers]
  Init --> Refresh[Refresh observed state]
  Refresh --> Plan[Plan graph diff]
  Plan --> Review{Plan matches intent}
  Review -- no --> Fix[Fix config, import, moved block, or drift]
  Review -- yes --> Lock[Acquire state lock]
  Lock --> Apply[Apply actions]
  Apply --> Persist[Persist new state]
  Persist --> Validate[Post-apply validation]
  Validate --> Done[Release lock and record evidence]

Prompt: “A Terraform plan wants to replace the production GPU node pool after a harmless module refactor.”

Strong answer:

  1. Stop the pipeline before apply.
  2. Identify whether resource addresses changed because of module path, count index, for_each key, provider alias, or workspace/backend mismatch.
  3. Compare current state addresses with intended config addresses.
  4. Use moved blocks or terraform state mv only after review and backup.
  5. Re-plan until the diff expresses only intended changes.
  6. Add a CI gate that fails plans with unexpected destroy/replace actions in protected environments.

Prompt: “State lock is stuck during an incident rollback.”

Strong answer:

  1. Confirm no apply is still running in CI, local machines, or an automation controller.
  2. Inspect backend lock metadata and timestamps.
  3. Decide if the incident risk justifies force-unlock.
  4. Force-unlock only with incident commander approval and audit note.
  5. Immediately refresh and inspect state before any apply.
  6. Capture prevention: job timeout cleanup, clearer lock ownership, and runbook for backend-specific locks.

Prompt: “A provider upgrade creates a huge no-op-looking plan with default tag and metadata changes.”

Strong answer:

  1. Treat provider upgrades as infrastructure changes, not dependency chores.
  2. Read provider release notes for schema/default changes.
  3. Test in lower environment with production-like resources.
  4. Separate provider upgrade from functional infrastructure changes.
  5. Add plan filters or policy checks for replacement/destruction.
  6. Roll out by environment and keep the previous lockfile/provider version available.

Example: automated GPU node quarantine and repair.

  1. Detect unhealthy condition through trusted signals: DCGM, kubelet, pod failures, kernel logs.
  2. Correlate signals to avoid false positives.
  3. Cordon node and mark reason.
  4. Respect PDBs and workload priority before drain.
  5. Drain with timeout and exception handling.
  6. Run repair action: restart runtime, reload driver if safe, reboot, or move to hardware diagnostics.
  7. Validate node: driver, device plugin, DCGM, test pod, workload canary.
  8. Uncordon only after validation.
  9. Emit event, metrics, audit log, and incident annotation.
stateDiagram-v2
  [*] --> Detect
  Detect --> Diagnose
  Diagnose --> Plan
  Plan --> GuardrailCheck
  GuardrailCheck --> Execute: approved
  GuardrailCheck --> HumanReview: risky
  Execute --> Verify
  Verify --> Record
  HumanReview --> Execute: approved
  HumanReview --> Record: rejected
  Record --> [*]
  • Max concurrent nodes per pool.
  • Exclude critical pools unless explicit flag.
  • Require approval for destructive or customer-impacting actions.
  • Precheck SLO and capacity headroom before drain.
  • Kill switch.
  • Reconciliation loop should converge, not thrash.

Pipeline stages:

  1. Static validation: formatting, lint, schema, policy.
  2. Unit tests for modules and scripts.
  3. Plan/diff generation.
  4. Human review for high-risk change.
  5. Apply to dev/staging/canary pool.
  6. Post-apply health checks.
  7. Progressive promotion.
  8. Automatic rollback or halt on health gate failure.

Prompt: “How do you decide what to automate?”

Answer:

I prioritize workflows that are frequent, high-risk, time-sensitive, or require precision under incident pressure. I start by measuring toil and incident contribution, then build a small workflow with prechecks and observability. If it proves value, I turn it into a supported path with tests, docs, ownership, and adoption metrics.