Engineering IC interview prep.

A senior infra / devtools SWE IC ships products other engineers depend on - CI/CD, kv stores, schedulers, observability, feature flags, dev platforms - often partly open-source, always at the dependency layer of every customer's stack.

What interviewers look for

  • Can the candidate design infra at scale: API + data + state machine + consistency model + degradation behavior - not just sketch boxes?
  • Are they API-design disciplined: REST / gRPC choice, request shape, idempotency, versioning, breaking-change discipline, backward-compat as a multi-quarter discipline?
  • Do they understand the dependency-layer responsibility: their SLO is the customer's SLO, blast radius cascades across customers, the postmortem is published?
  • Can they handle production at the infra layer: read signals, isolate cause, mitigate (rollback, throttle, kill switch) before the cascade widens, then root-cause?
  • Do they show OSS / community fluency where relevant: maintainer dynamics, contribution discipline, the project-vs-product line for OSS-led businesses?
  • Are they DX-aware: error messages, docs, SDKs, onboarding - infra is a product whose users are engineers, and bad DX shows up as support tickets + churn?

Behavioural questions to expect

  1. Walk me through your CV.

    What it tests: Story coherence + genuine fit for the devtools / infra IC seat. Teams want evidence of building things other engineers depend on, API + system discipline, production ownership at scale - not just shipped features.

  2. Tell me about your most impactful technical project.

    What it tests: Depth + ownership + the willingness to defend a technical choice. Tests whether the candidate frames problem → approach (with tradeoffs) → result (quantified) → lessons, not just describes a system.

  3. Tell me about a weakness, a failure, or feedback you've received and worked on.

    What it tests: Self-awareness + production discipline. Cross-role canonical. Infra mistakes (shipped a breaking API change, missed a backward-compat gap, owned a P0 that cascaded across customers) shape teams.

  4. Why devtools + infrastructure - and why this firm vs consumer / SaaS / hyperscaler?

    What it tests: Authentic fit for the dependency-layer + API-disciplined + sometimes-OSS seat: shipping infrastructure other engineers depend on, the discipline of versioning + backward compat, the customer who is an engineer. Tests whether the candidate WANTS this vs the more user-visible alternatives.

  5. Which team or technical area would you want to work on, and why?

    What it tests: Genuine fit + grasp of how devtools / infra areas differ. Tests whether the candidate has a reasoned preference (compute / orchestration / data / observability / DX / API platform / a specific OSS project) rather than 'wherever'.

  6. Why this firm?

    What it tests: Whether the candidate has done the homework. Bar: firm-specific evidence from the product, eng culture, OSS posture, customers, scale, people - not generic 'great tech'.

  7. How would you describe this firm's engineering organisation + infrastructure in your own words?

    What it tests: Whether the candidate has internalized HOW the firm builds infra - org shape, stack, OSS posture, customer-SLO contract - not just that it 'has engineers'. Tests whether they've read the eng blog + RFCs + OSS repo.

  8. How does engineering actually drive value at a devtools / infra firm?

    What it tests: Whether the candidate understands devtools / infra business economics: developer adoption (often via OSS), customer retention + expansion via reliable + well-designed infra, support volume + churn as the bad-DX signal; engineering velocity + reliability + community.

Technical concepts to master

Distributed-systems primitives (infra scale)

Consensus (Raft / Paxos)
Distributed consensus protocols (Raft is the modern default; Paxos is the academic ancestor) for replicated state machines (configuration, metadata, leader election).
Sharding + multi-tenant isolation
Shard by tenant (customer) for fairness + blast-radius limitation; per-tenant limits to prevent noisy-neighbour.
Replication + consistency
Sync replication for strong consistency (control plane, metadata); async for throughput (data plane, telemetry).
Graceful degradation
Design infra to degrade gracefully under partial failure - fail open vs fail closed depending on use case; cached-fallback values; backpressure.

API design + backward compat + deprecation

API design principles
Predictable + consistent (verbs, naming, error model); idempotent for writes; backward-compatible by default; clear deprecation path; documented + SDK'd.
Versioning strategies
URI (/v1/, /v2/), header (X-API-Version), or media-type negotiation; SDK uses semver (major.minor.patch) where major = breaking.
Backward compatibility discipline
Within a version: never remove a field, never change a type, never make optional → required, never break a semantic; additive changes only (new optional fields, new endpoints).
Deprecation workflow
Announce → mark deprecated in code + docs + response headers → publish migration guide + tooling → individual customer outreach → telemetry on usage → sunset on schedule.

Open source + community + governance

OSS business models
Open-core (free OSS + paid enterprise features); OSS + paid support; source-available (BSL / SSPL); managed cloud service on top of OSS.
Maintainer responsibilities
Issue / PR triage, RFC reviews, release management, security disclosures, governance for major decisions; an engineering responsibility with publish-the-postmortem accountability.
Contribution + community discipline
Contribution guidelines + code of conduct + RFC process + review cadence + recognition; designed to balance throughput with quality + community health.
OSS license + governance
License choice (permissive, copyleft, business-source) shapes adoption + monetization; governance (founder-led, foundation-governed, BDFL) shapes community trust.

Reliability + SLO + the dependency-layer responsibility

SLO + customer contract
Customer-facing SLO (e.g. 99.9% / 99.99% uptime, P99 latency budget) baked into commercial terms; error budget = 1 - SLO; breaches trigger remediation + credits.
Blast radius + multi-tenant isolation
How much of the customer base is affected when a service fails; minimized by cell-based architecture (per-cell blast radius), per-tenant rate limits, region isolation, deploys per cell.
Postmortem + customer communication
Blameless postmortems within 5-10 days; customer-facing summary (acknowledge, root cause, prevent) on status page + email; major incidents trigger an exec or eng-lead apology + remediation credit.
Change management + canary discipline
Per-cell + per-region canary; staged rollout (1% → 10% → 50% → 100%) with automated rollback on guardrail breach; feature flags + kill switches for fast disable.

Practical drills

  • Design capacity for a feature-flag service that serves 1000 customer apps, each handling 10K QPS average / 30K peak. The flag service is called once per request on average. P99 budget 5ms. (a) Total flag-eval QPS. (b) Why local caching is essential. (c) Rough infra need with + without local caching.
  • Design a CI/CD orchestration service for this firm. Customers connect their git repos + define pipelines; the service runs them. Walk me through.
  • Your kv-store service is alerting on P99 latency spike (50ms → 500ms) across all customers in one region. Walk me through the next 60 minutes.

Smart-question anchors

  • Team + scope - the team's surface area, what the role would specifically own in 6-12 months
  • API + backward-compat posture - the firm's API design + deprecation discipline, breaking-change policy
  • Open source - OSS-led vs open-core vs proprietary; maintainer responsibilities + governance
  • On-call + reliability - SLO + error-budget approach, customer-published postmortems, recent incidents
  • RFC + design culture - how big architectural decisions are made, who writes + reviews RFCs

Related roles

Sourced from

Ready to Generate Your Own Prep?

Drop your CV and a job description on the home page. A couple of minutes later you get a report with everything you need to land the job.