Engineering IC interview prep.
An ML engineer at a frontier AI lab or platform firm builds + runs the training infrastructure + the inference serving + the data pipelines + the MLOps that make research output actually run + scale.
What interviewers look for
- Can the candidate design distributed training at scale - DP / TP / PP / FSDP, parameter sharding, optimizer state, gradient communication - not just say 'we'd use Megatron-equivalent'?
- Can they design inference serving for low-latency: batching strategies, KV cache, speculative decoding, quantization, multi-replica deployment, autoscaling?
- Do they understand GPU performance: utilization (MFU + HFU), kernel efficiency, memory hierarchy, communication overhead - can they debug it?
- Are they production-ML disciplined: deployment patterns, monitoring + drift, A/B testing models, retraining cadence, rollback plans?
- Can they partner with research scientists: implement an idea correctly, push back when infra reality conflicts with research preference, deliver something that ships?
- Do they show senior behavioral signals: technical influence, ownership of large infra, ability to navigate the research / eng / product interface?
Behavioural questions to expect
Walk me through your CV.
What it tests: Story coherence + ML systems engineering scope. Teams want evidence of ML-specific systems work (training, inference, data) + the engineering bar that frontier labs demand - not pure SWE moving without ML depth.
Tell me about your most impactful ML systems project.
What it tests: Depth + ownership + ML systems judgment. Tests whether the candidate frames problem → approach (with parallelism / inference / data tradeoffs) → result (quantified) → lessons.
Tell me about a weakness, a failure, or feedback you've received and worked on.
What it tests: Self-awareness + ML production discipline. Cross-role canonical.
Why ML engineering - and why this firm vs generic SWE or research?
What it tests: Authentic fit for the systems-meets-research + scale + production + GPU-aware seat. Tests whether the candidate WANTS the ML-specific challenges (parallelism, GPU, training stability, drift) - not just 'I want to work on AI'.
Which team or area would you want to work on, and why?
What it tests: Genuine fit + grasp of how ML eng areas differ (training infra / inference / data / MLOps / GPU-specific). Tests whether the candidate has a reasoned preference.
Why this firm?
What it tests: Whether the candidate has done the homework on the firm's ML eng + research stack.
How would you describe this firm's ML engineering organisation in your own words?
What it tests: Whether the candidate has internalized HOW the firm builds + runs ML systems.
How does ML engineering actually create value at an AI platform firm?
What it tests: Whether the candidate understands ML eng economics: training infra scale + cost dominates compute spend; inference cost + latency drive unit economics + customer experience; production reliability protects brand + customer; research-engineering velocity drives capability shipping rate.
Technical concepts to master
Distributed training + parallelism
- Parallelism strategy selection
- Pick based on memory budget + comms cost: DP if model fits per GPU; FSDP if not; add TP for large layers; add PP for many layers; combine at scale.
- Communication + topology
- All-reduce for gradient sync (DP); all-gather + reduce-scatter for FSDP; per-stage activations for PP. Intra-node high-bandwidth (NVLink) + inter-node lower-bandwidth (InfiniBand).
- Training stability + recovery
- Failure at scale is non-zero; checkpointing, restart, hardware-failure detection are essential. Loss spikes + numerical instability require monitoring + intervention.
- MFU + HFU
- Model FLOPS Utilization = useful FLOPS / peak hardware FLOPS; Hardware FLOPS Utilization includes all FLOPS even wasted. Frontier targets ~50-60% MFU.
Inference serving + low-latency
- Continuous batching
- Dynamic batching that adds + removes requests at the token level (not request level); handles variable-length input + output natively.
- KV cache
- Cache key + value tensors from attention layers across decode steps; massive memory savings + speed-up (decode is autoregressive).
- Quantization + speculative decoding
- Quantization (FP8 / INT8 / INT4 weights + activations) trades quality for throughput. Speculative decoding (smaller draft model proposes tokens, large model verifies) reduces per-token cost.
- Multi-replica + autoscaling
- Replicate inference servers for QPS scale + multi-region for latency; autoscale based on queue depth + utilization.
Data pipelines + feature stores + training-serving consistency
- Data pipeline at training scale
- Parallel data loading + preprocessing + shuffling at GPU throughput; sharded data format (Parquet / WebDataset / custom) + distributed sampling.
- Feature store
- Centralised store of computed features with offline (training) + online (serving) consistency; ensures training + serving see the same feature semantics.
- Data drift + quality
- Monitor input data distribution + statistics over time; drift detection triggers retraining; data quality checks (schema, null, range) at ingestion.
- Labeling + active learning
- Human labeling pipeline + quality control; active learning prioritises uncertain examples for labeling; synthetic data + human-in-loop refinement.
MLOps + deployment + monitoring + drift
- Deployment patterns - canary, shadow, A/B
- Canary: small % of traffic to new model. Shadow: new model receives traffic but doesn't affect users (compare offline). A/B: traffic split + measure user outcomes.
- Model monitoring
- Track latency + throughput + cost (system metrics) AND prediction quality + distribution drift + business metric (ML metrics).
- Drift detection + retraining
- Statistical tests for input + output distribution shift; threshold triggers retraining (or alert for human review); retraining cadence balances cost vs freshness.
- Rollback + safe degradation
- Fast rollback to previous model version if new version causes quality regression or system issue; safe degradation to simpler model if main is unavailable.
Practical drills
- Design distributed training for a 100B-parameter transformer LLM on 1024 GPUs. Walk me through.
- Design low-latency LLM inference serving for this firm's API at 100K QPS with P99 < 100ms TTFT + <50ms per-token.
- Training is at 30% MFU on 64 H100s. Walk me through how you'd diagnose + improve.
Smart-question anchors
- Team + scope - team's surface area, what the role would own in 6-12 months
- Stack + scale - training cluster size, framework, inference scale, hardware investment
- Research-engineering collaboration - how research becomes product, RFC + design review culture
- On-call + reliability - training-run reliability, inference SLO, postmortem culture
- Cost + efficiency - GPU utilization targets, FinOps maturity, recent efficiency programs
Related roles
Sourced from
- interviewing.io + Hello Interview + IGotAnOffer, system design canon
- ML systems literature (distributed training + parallelism)
- MLOps + production ML literature (Google ML system design / Continuous Delivery for ML)
- Inference serving + GPU optimization literature (NVIDIA tech blogs + practitioner content)
- Tech Interview Handbook + Eng Leadership Newsletter, behavioral
- Frontier-lab ML engineering blogs (OpenAI / Anthropic / DeepMind / Meta AI engineering content)
Ready to Generate Your Own Prep?
Drop your CV and a job description on the home page. A couple of minutes later you get a report with everything you need to land the job.