Security and Leadership Q&A

201. Why is RBAC least privilege hard for automation?

Automation often spans resources and namespaces. The trick is granting the exact verbs and scopes needed while separating read, plan, execute, and break-glass roles.

202. What is wrong with giving repair automation cluster-admin?

It hides permission design, expands blast radius, weakens auditability, and lets one bug mutate unrelated resources.

203. How do you scope node-repair RBAC?

Allow read broadly where needed, write only approved labels/taints/events, drain/cordon only target pools, and keep destructive actions behind separate approval.

204. What is the Kubernetes security risk of privileged GPU pods?

Privileged pods can access host capabilities. GPU stack components may need it, but application workloads generally should not inherit that trust.

205. How do you isolate privileged system components?

Dedicated namespaces, strict RBAC, admission exceptions by service account, signed images, node selectors, audit logs, and policy that blocks app teams from copying the pattern.

206. What replaced PodSecurityPolicy?

Pod Security Standards and Pod Security Admission, plus policy engines such as Kyverno or OPA Gatekeeper for richer controls.

207. When would you use Kyverno over OPA Gatekeeper?

Kyverno is Kubernetes-native and approachable for validation/mutation/generation. Gatekeeper is strong when teams are comfortable with Rego and want broader policy expressiveness.

208. What is the risk of mutation policies?

They can hide what was actually deployed, create surprising diffs, and conflict with GitOps unless rendered/live state is visible.

209. How do you roll out a strict security policy?

Audit mode, measure violations, fix common legitimate cases, communicate, enforce by namespace/risk tier, and maintain an emergency exception workflow.

210. What should never be stored in Git?

Plaintext secrets, long-lived credentials, private keys, unredacted state files, or sensitive model/customer data.

211. How do GitOps teams manage secrets?

External Secrets, SOPS-encrypted manifests, sealed secrets, cloud secret managers, KMS integration, and strict decryption permissions.

212. What is the supply-chain risk in Kubernetes deploys?

Running untrusted or mutable artifacts. Use signed images, pinned digests, SBOMs, vulnerability scanning, provenance, and trusted registries.

213. Why pin images by digest?

Tags can move. Digests provide immutable identity for reproducibility, audit, canary comparison, and rollback.

214. What is the security issue with debug containers?

They can expose sensitive runtime state or host capabilities. Access must be RBAC-controlled, audited, and time-bound.

215. How do you secure admission for GPU workloads?

Require approved runtime images, resource envelope, allowed GPU pools, non-privileged app containers, signed artifacts, owner labels, and valid health checks.

216. What is a tenant isolation question for GPU sharing?

Whether tenants trust each other enough for time-slicing/MPS or need stronger isolation through MIG, dedicated GPUs, separate pools, or separate clusters.

217. What is the staff-level answer to "one cluster or many clusters"?

It depends on failure domains, team autonomy, compliance, control-plane scale, network topology, cost, and operational maturity. Avoid one cluster becoming one blast radius.

218. What is a platform paved road?

A supported default path with good security, delivery, observability, and reliability built in, while still allowing controlled exceptions.

219. Why do platform teams fail?

They optimize for control instead of adoption, build abstractions that hide too much, ignore migration cost, or do not measure developer/operator outcomes.

220. How do you earn adoption for a platform capability?

Solve a painful workflow, prove value with metrics, provide migration help, preserve escape hatches, and make the better path easier.

221. What is a good platform SLO?

One tied to user/team outcomes: deploy success, time to provision, incident MTTR, workflow latency, or availability of critical control-plane services.

222. Why is "number of platform features shipped" a weak metric?

It measures output, not adoption, reliability, toil reduction, or business impact.

223. How do you influence without authority?

Use shared goals, data, prototypes, written tradeoffs, small wins, and decision records that make alignment easier.

224. How do you handle a principal engineer disagreeing with your design?

Clarify constraints, identify the decision owner, write tradeoffs, test disputed assumptions, and avoid making it personal.

225. What is a good design doc decision section?

Options considered, criteria, chosen path, rejected alternatives, risks, mitigation, rollout plan, and revisit triggers.

226. What is a reversible decision?

A decision that can be changed cheaply. Treat it with lighter process than one-way doors like data model, security boundary, or fleet architecture.

227. How do you communicate risk to executives?

Translate technical risk into customer impact, probability, blast radius, mitigation, timeline, and decision needed.

228. How do you communicate risk to engineers?

Give concrete failure modes, evidence, tradeoffs, operational constraints, and the exact invariant to preserve.

229. What is a senior/staff anti-pattern?

Doing all the hard work personally instead of creating systems, standards, and people leverage.

230. How do you mentor during incidents?

Assign scoped roles, explain hypothesis testing, keep safety boundaries clear, and debrief after recovery without slowing mitigation.

231. What is a good answer to "tell me about failure"?

Own the mistake, describe impact, explain mitigation, show what changed permanently, and avoid blaming other teams.

232. What is a bad answer to "conflict"?

One where you won by escalation or charisma without demonstrating empathy, evidence, tradeoff analysis, or durable alignment.

233. How do you show staff-level judgment in behavioral answers?

Talk about ambiguity, constraints, tradeoffs, organizational alignment, measurable outcomes, and systems that outlasted your direct involvement.

234. What is the trap in saying "I am hands-on"?

It can sound like you only execute. Pair it with leverage: you debug deeply and turn learnings into tools, contracts, and standards.

235. How do you answer "why NVIDIA" without sounding generic?

Tie accelerated computing, AI inference at scale, GPU-aware infrastructure, and hardware/software boundary to your production systems strengths.

236. What should you ask the hiring manager?

Ask which failure classes, manual workflows, or scaling constraints the team most wants this role to solve in the first six months.

237. What is a strong answer to "what would you do first"?

Map architecture, incidents, toil, SLOs, and ownership; shadow operations; ship one low-risk improvement; then remove a repeat failure mode.

238. How do you avoid proposing a rewrite?

Start with evidence, stabilize the current system, remove high-value pain, then use data to justify deeper architectural change.

239. When is a rewrite justified?

When incremental change cannot meet clear reliability, security, scale, or velocity requirements, and migration risk is lower than continued operation.

240. What is the hardest part of platform migration?

Running old and new paths safely, preserving rollback, avoiding feature gaps, migrating tenants, and keeping teams productive.

241. What is a good migration strategy?

Segment users, define compatibility contracts, build adapters, migrate low-risk workloads first, measure, and keep rollback until confidence is real.

242. How do you design ownership boundaries?

Define APIs, SLOs, escalation, dashboards, runbooks, lifecycle responsibilities, and what each team is explicitly not responsible for.

243. What is operational excellence?

Systems that are understandable, observable, recoverable, secure, cost-aware, and continuously improved from incidents and user feedback.

244. What is the difference between reliability and availability?

Availability is service being usable. Reliability includes consistent correct behavior over time, including latency, correctness, recovery, and operations.

245. What is resilience?

The system’s ability to absorb failures, degrade gracefully, recover quickly, and prevent local failures from becoming systemic.

246. What is a good staff-level tradeoff answer?

State the goal, list constraints, compare options, identify what you would measure, choose a path, and name the conditions that would change your mind.

247. What makes an interviewer think "this person is senior"?

They clarify, scope, reason by layers, discuss tradeoffs, protect users, use evidence, and turn fixes into durable systems.

248. What makes an interviewer think "this person is staff-level"?

They define contracts, influence teams, reduce ambiguity, design operating models, and create leverage beyond their own code.

249. What is the biggest interview risk for this role?

Sounding tool-familiar but not operationally deep. You need to show how tools behave under failure, scale, and organizational pressure.

250. What is the closing message for the whole interview?

You can operate and improve AI inference infrastructure across Kubernetes, GPU systems, automation, observability, and incident response while leading others through ambiguity.