Engineering IC interview prep.

An ML engineer at a frontier AI lab or platform firm builds + runs the training infrastructure + the inference serving + the data pipelines + the MLOps that make research output actually run + scale.

What interviewers look for

  • Can the candidate design distributed training at scale - DP / TP / PP / FSDP, parameter sharding, optimizer state, gradient communication - not just say 'we'd use Megatron-equivalent'?
  • Can they design inference serving for low-latency: batching strategies, KV cache, speculative decoding, quantization, multi-replica deployment, autoscaling?
  • Do they understand GPU performance: utilization (MFU + HFU), kernel efficiency, memory hierarchy, communication overhead - can they debug it?
  • Are they production-ML disciplined: deployment patterns, monitoring + drift, A/B testing models, retraining cadence, rollback plans?
  • Can they partner with research scientists: implement an idea correctly, push back when infra reality conflicts with research preference, deliver something that ships?
  • Do they show senior behavioral signals: technical influence, ownership of large infra, ability to navigate the research / eng / product interface?

Behavioural questions to expect

  1. Walk me through your CV.

    What it tests: Story coherence + ML systems engineering scope. Teams want evidence of ML-specific systems work (training, inference, data) + the engineering bar that frontier labs demand - not pure SWE moving without ML depth.

  2. Tell me about your most impactful ML systems project.

    What it tests: Depth + ownership + ML systems judgment. Tests whether the candidate frames problem → approach (with parallelism / inference / data tradeoffs) → result (quantified) → lessons.

  3. Tell me about a weakness, a failure, or feedback you've received and worked on.

    What it tests: Self-awareness + ML production discipline. Cross-role canonical.

  4. Why ML engineering - and why this firm vs generic SWE or research?

    What it tests: Authentic fit for the systems-meets-research + scale + production + GPU-aware seat. Tests whether the candidate WANTS the ML-specific challenges (parallelism, GPU, training stability, drift) - not just 'I want to work on AI'.

  5. Which team or area would you want to work on, and why?

    What it tests: Genuine fit + grasp of how ML eng areas differ (training infra / inference / data / MLOps / GPU-specific). Tests whether the candidate has a reasoned preference.

  6. Why this firm?

    What it tests: Whether the candidate has done the homework on the firm's ML eng + research stack.

  7. How would you describe this firm's ML engineering organisation in your own words?

    What it tests: Whether the candidate has internalized HOW the firm builds + runs ML systems.

  8. How does ML engineering actually create value at an AI platform firm?

    What it tests: Whether the candidate understands ML eng economics: training infra scale + cost dominates compute spend; inference cost + latency drive unit economics + customer experience; production reliability protects brand + customer; research-engineering velocity drives capability shipping rate.

Technical concepts to master

Distributed training + parallelism

Parallelism strategy selection
Pick based on memory budget + comms cost: DP if model fits per GPU; FSDP if not; add TP for large layers; add PP for many layers; combine at scale.
Communication + topology
All-reduce for gradient sync (DP); all-gather + reduce-scatter for FSDP; per-stage activations for PP. Intra-node high-bandwidth (NVLink) + inter-node lower-bandwidth (InfiniBand).
Training stability + recovery
Failure at scale is non-zero; checkpointing, restart, hardware-failure detection are essential. Loss spikes + numerical instability require monitoring + intervention.
MFU + HFU
Model FLOPS Utilization = useful FLOPS / peak hardware FLOPS; Hardware FLOPS Utilization includes all FLOPS even wasted. Frontier targets ~50-60% MFU.

Inference serving + low-latency

Continuous batching
Dynamic batching that adds + removes requests at the token level (not request level); handles variable-length input + output natively.
KV cache
Cache key + value tensors from attention layers across decode steps; massive memory savings + speed-up (decode is autoregressive).
Quantization + speculative decoding
Quantization (FP8 / INT8 / INT4 weights + activations) trades quality for throughput. Speculative decoding (smaller draft model proposes tokens, large model verifies) reduces per-token cost.
Multi-replica + autoscaling
Replicate inference servers for QPS scale + multi-region for latency; autoscale based on queue depth + utilization.

Data pipelines + feature stores + training-serving consistency

Data pipeline at training scale
Parallel data loading + preprocessing + shuffling at GPU throughput; sharded data format (Parquet / WebDataset / custom) + distributed sampling.
Feature store
Centralised store of computed features with offline (training) + online (serving) consistency; ensures training + serving see the same feature semantics.
Data drift + quality
Monitor input data distribution + statistics over time; drift detection triggers retraining; data quality checks (schema, null, range) at ingestion.
Labeling + active learning
Human labeling pipeline + quality control; active learning prioritises uncertain examples for labeling; synthetic data + human-in-loop refinement.

MLOps + deployment + monitoring + drift

Deployment patterns - canary, shadow, A/B
Canary: small % of traffic to new model. Shadow: new model receives traffic but doesn't affect users (compare offline). A/B: traffic split + measure user outcomes.
Model monitoring
Track latency + throughput + cost (system metrics) AND prediction quality + distribution drift + business metric (ML metrics).
Drift detection + retraining
Statistical tests for input + output distribution shift; threshold triggers retraining (or alert for human review); retraining cadence balances cost vs freshness.
Rollback + safe degradation
Fast rollback to previous model version if new version causes quality regression or system issue; safe degradation to simpler model if main is unavailable.

Practical drills

  • Design distributed training for a 100B-parameter transformer LLM on 1024 GPUs. Walk me through.
  • Design low-latency LLM inference serving for this firm's API at 100K QPS with P99 < 100ms TTFT + <50ms per-token.
  • Training is at 30% MFU on 64 H100s. Walk me through how you'd diagnose + improve.

Smart-question anchors

  • Team + scope - team's surface area, what the role would own in 6-12 months
  • Stack + scale - training cluster size, framework, inference scale, hardware investment
  • Research-engineering collaboration - how research becomes product, RFC + design review culture
  • On-call + reliability - training-run reliability, inference SLO, postmortem culture
  • Cost + efficiency - GPU utilization targets, FinOps maturity, recent efficiency programs

Related roles

Sourced from

Ready to Generate Your Own Prep?

Drop your CV and a job description on the home page. A couple of minutes later you get a report with everything you need to land the job.