Automation and IaC
Automation Philosophy
Section titled “Automation Philosophy”Good automation is not “a script that works once.” It is a controlled operational workflow:
- Idempotent
- Observable
- Authenticated and authorized
- Auditable
- Dry-run capable
- Blast-radius limited
- Rollback-aware
- Tested against real failure cases
Automation Maturity Ladder
Section titled “Automation Maturity Ladder”| Level | Description |
|---|---|
| Manual runbook | Human executes commands. Useful for discovery, risky at scale. |
| Script | Repeatable but often lacks safety, state, and audit. |
| Tool | Has inputs, validation, logs, errors, dry run, and tests. |
| Workflow | Orchestrates multiple steps with gates, retries, and approvals. |
| Platform primitive | Owned, documented, observable, integrated into normal operations. |
flowchart TD Manual[Manual runbook] --> Script[Scripted task] Script --> Idempotent[Idempotent tool] Idempotent --> Workflow[Workflow with gates] Workflow --> Primitive[Platform primitive] Primitive --> Learning[Incident learning and prevention]
IaC Interview Points
Section titled “IaC Interview Points”Discuss:
- State management and locking.
- Drift detection.
- Immutable artifacts and version pinning.
- Module boundaries.
- Environment promotion.
- Secret handling.
- Policy-as-code.
- Rollback and import strategy for existing resources.
- Testing: lint, plan review, integration, post-apply validation.
Terraform/OpenTofu State Quirks
Section titled “Terraform/OpenTofu State Quirks”State is not bookkeeping trivia. It is the authority Terraform/OpenTofu uses to map configuration addresses to real infrastructure IDs. Senior interviewers will expect you to know where state lies, where it protects you, and where it can betray you.
| Quirk | Why it matters | Senior response |
|---|---|---|
| Stale state | A plan can be based on outdated observations if refresh is skipped, blocked, or racing external changes. | Refresh intentionally, inspect provider/API health, and avoid approving high-risk plans from stale CI. |
| Lock stuck after failed apply | The backend may still hold a lock after a crashed job. | Confirm no active run, inspect backend metadata, then force-unlock only with audit and owner approval. |
| Sensitive values in state | Providers often store generated passwords, tokens, connection strings, or private endpoints in state. | Encrypt backend, restrict access, avoid broad CI logs/artifacts, and rotate if state exposure occurs. |
count index churn | Removing an item from the middle of a counted list can shift addresses and propose destructive replacements. | Prefer for_each with stable keys for long-lived resources. |
| Provider schema change | Upgrading providers can reinterpret defaults or computed fields and create noisy or dangerous plans. | Pin providers, review changelogs, test upgrade plans in lower environments, and upgrade deliberately. |
| Import mismatch | Imported resources can have config that does not fully match live defaults, causing immediate drift. | Import, then reconcile config until plan is empty or intentionally documented. |
| Moved resource addresses | Refactors without moved blocks look like destroy/create even when infrastructure should be preserved. | Use moved blocks for code refactors and review address changes as production-risk changes. |
ignore_changes abuse | It can hide real drift and make Terraform stop owning fields people assume it owns. | Use narrowly, document owner of ignored fields, and alert on ignored-field drift if it affects safety. |
| Data source instability | A data source result can change between plan and apply, especially for “latest” images or broad tag selectors. | Pin immutable IDs for production or fail the plan if selected artifacts are not approved. |
| Backend mis-scope | Accidentally pointing prod config at dev state or vice versa can create catastrophic plans. | Isolate credentials/backends, print workspace/backend in CI, and require environment-specific review. |
| Partial apply | Some resources may be changed before a later API call fails. | Treat failed apply as an incident: refresh state, inspect real resources, and converge deliberately. |
| Manual state surgery | state rm, state mv, and direct state edits can sever ownership or corrupt dependency assumptions. | Prefer declarative moved and import flows; require peer review, backup, and rollback plan for state. |
flowchart TD
Change[HCL change] --> Init[Init backend and providers]
Init --> Refresh[Refresh observed state]
Refresh --> Plan[Plan graph diff]
Plan --> Review{Plan matches intent}
Review -- no --> Fix[Fix config, import, moved block, or drift]
Review -- yes --> Lock[Acquire state lock]
Lock --> Apply[Apply actions]
Apply --> Persist[Persist new state]
Persist --> Validate[Post-apply validation]
Validate --> Done[Release lock and record evidence]
State Incident Drills
Section titled “State Incident Drills”Prompt: “A Terraform plan wants to replace the production GPU node pool after a harmless module refactor.”
Strong answer:
- Stop the pipeline before apply.
- Identify whether resource addresses changed because of module path,
countindex,for_eachkey, provider alias, or workspace/backend mismatch. - Compare current state addresses with intended config addresses.
- Use
movedblocks orterraform state mvonly after review and backup. - Re-plan until the diff expresses only intended changes.
- Add a CI gate that fails plans with unexpected destroy/replace actions in protected environments.
Prompt: “State lock is stuck during an incident rollback.”
Strong answer:
- Confirm no apply is still running in CI, local machines, or an automation controller.
- Inspect backend lock metadata and timestamps.
- Decide if the incident risk justifies force-unlock.
- Force-unlock only with incident commander approval and audit note.
- Immediately refresh and inspect state before any apply.
- Capture prevention: job timeout cleanup, clearer lock ownership, and runbook for backend-specific locks.
Prompt: “A provider upgrade creates a huge no-op-looking plan with default tag and metadata changes.”
Strong answer:
- Treat provider upgrades as infrastructure changes, not dependency chores.
- Read provider release notes for schema/default changes.
- Test in lower environment with production-like resources.
- Separate provider upgrade from functional infrastructure changes.
- Add plan filters or policy checks for replacement/destruction.
- Roll out by environment and keep the previous lockfile/provider version available.
Safe Repair Workflow
Section titled “Safe Repair Workflow”Example: automated GPU node quarantine and repair.
- Detect unhealthy condition through trusted signals: DCGM, kubelet, pod failures, kernel logs.
- Correlate signals to avoid false positives.
- Cordon node and mark reason.
- Respect PDBs and workload priority before drain.
- Drain with timeout and exception handling.
- Run repair action: restart runtime, reload driver if safe, reboot, or move to hardware diagnostics.
- Validate node: driver, device plugin, DCGM, test pod, workload canary.
- Uncordon only after validation.
- Emit event, metrics, audit log, and incident annotation.
stateDiagram-v2 [*] --> Detect Detect --> Diagnose Diagnose --> Plan Plan --> GuardrailCheck GuardrailCheck --> Execute: approved GuardrailCheck --> HumanReview: risky Execute --> Verify Verify --> Record HumanReview --> Execute: approved HumanReview --> Record: rejected Record --> [*]
Guardrails
Section titled “Guardrails”- Max concurrent nodes per pool.
- Exclude critical pools unless explicit flag.
- Require approval for destructive or customer-impacting actions.
- Precheck SLO and capacity headroom before drain.
- Kill switch.
- Reconciliation loop should converge, not thrash.
CI/CD For Infrastructure
Section titled “CI/CD For Infrastructure”Pipeline stages:
- Static validation: formatting, lint, schema, policy.
- Unit tests for modules and scripts.
- Plan/diff generation.
- Human review for high-risk change.
- Apply to dev/staging/canary pool.
- Post-apply health checks.
- Progressive promotion.
- Automatic rollback or halt on health gate failure.
Senior Answer Frame
Section titled “Senior Answer Frame”Prompt: “How do you decide what to automate?”
Answer:
I prioritize workflows that are frequent, high-risk, time-sensitive, or require precision under incident pressure. I start by measuring toil and incident contribution, then build a small workflow with prechecks and observability. If it proves value, I turn it into a supported path with tests, docs, ownership, and adoption metrics.