Coding Rounds
The coding interview for an AI engineer is the round people most consistently mis-prepare. They drill LeetCode and get blindsided by "build a RAG pipeline in 45 minutes," or they live in notebooks all year and freeze on an LRU cache. The format is bimodal: nearly every loop pairs an AI-specific exercise — retrieval, an agent loop, attention from memory — with a classic data-structures round that is alive and well even at the frontier labs. Prepare for exactly one half and you will fail the other.
The AI-specific half is more predictable than candidates expect. The same dozen exercises recur across companies, and the single most common one — build a small RAG pipeline end to end — can be rehearsed to muscle memory. The DSA half has drifted toward "build a small system" variants (in-memory databases, versioned key-value stores) rather than pure algorithm puzzles. Both halves reward writing clean, working code under a clock more than cleverness.
The AI-specific exercises
A small set of exercises recurs across loops. They cluster into three families — retrieval and agents (applied roles), transformer primitives (research-leaning roles), and production plumbing (the criteria you are graded on even when they are not the named problem). Learn the family that matches your target role first.
Two exercises, worked
These two sketches are the centerpiece — the exercises most likely to decide an applied loop. Both are short enough to reproduce live and idiomatic enough to defend line by line.
1. A ReAct agent loop from scratch
The interviewer wants to see that the control flow lives in your code, not in the model's good intentions. A tool registry, a parse-dispatch-observe cycle, an explicit final-answer stop condition, and a hard step cap so a confused model cannot loop forever.
import json
# Tool registry: name -> callable. The dispatcher only runs what is registered.
TOOLS = {
"search": lambda q: f"results for {q}",
"calculator": lambda expr: str(eval(expr, {"__builtins__": {}})),
}
SYSTEM = (
"Solve the task. Each turn, emit JSON: "
'{"tool": <name>, "args": <string>} to call a tool, '
'or {"final": <answer>} when done.'
)
def run_agent(task, max_steps=6):
messages = [{"role": "system", "content": SYSTEM},
{"role": "user", "content": task}]
for _ in range(max_steps): # step cap: the loop is bounded by code
raw = llm(messages) # model call (stubbed)
messages.append({"role": "assistant", "content": raw})
try:
action = json.loads(raw)
except json.JSONDecodeError: # malformed output is an observation, not a crash
messages.append({"role": "user", "content": "Invalid JSON. Re-emit."})
continue
if "final" in action: # stop condition: the model declares it is done
return action["final"]
tool, args = action.get("tool"), action.get("args")
if tool not in TOOLS: # guard the dispatcher against unknown tools
obs = f"Unknown tool: {tool}"
else:
try:
obs = TOOLS[tool](args)
except Exception as e: # tool failure feeds back as an observation
obs = f"Tool error: {e}"
messages.append({"role": "user", "content": f"Observation: {obs}"})
return "Stopped: step cap reached." # the cap is the safety net, not the happy path
What the interviewer is probing
- Do you bound the loop with a step cap and a real stop condition, or does it run until the model happens to quit? See budgets and halting and the agent loop.
- Do tool errors and bad JSON become observations fed back to the model, or do they crash the run? Resilience is the signal.
- Is the dispatcher guarded — unknown tools handled,
evalsandboxed — rather than trusting model output blindly?
2. A minimal retrieval function
The heart of RAG, isolated: embed the query, score it against a matrix of document embeddings with cosine similarity, return the top-k chunks. Note where a reranker would slot in.
import numpy as np
def retrieve(query, chunks, doc_embeddings, embed_fn, k=3):
# doc_embeddings: (N, D) matrix, row i is the embedding of chunks[i]
q = embed_fn(query) # (D,)
# Cosine similarity = normalized dot product. Normalize once, then matmul.
q = q / (np.linalg.norm(q) + 1e-8)
docs = doc_embeddings / (np.linalg.norm(doc_embeddings, axis=1, keepdims=True) + 1e-8)
scores = docs @ q # (N,) similarity per chunk
top = np.argsort(scores)[::-1][:k] # indices of the k highest scores
return [(chunks[i], float(scores[i])) for i in top]
# Rerank slots in here: pull a wider top-k (say 20) above, then re-score those
# candidates with a cross-encoder and keep the best k before returning.
What the interviewer is probing
- Do you know cosine similarity is a normalized dot product, and can you vectorize it instead of looping over documents?
- Do you guard the divide-by-zero on a zero vector, and return scores so the caller can threshold?
- Do you know where reranking goes — retrieve wide, rerank narrow — and why a bi-encoder top-k feeds a cross-encoder, not the other way around?
Does classic DSA still appear?
Yes — emphatically, even at AI-first labs. The flavor has shifted: labs favor "build a small system" variants — in-memory databases, versioned key-value stores, crawlers, iterators — over pure algorithm puzzles. But the round is real, and treating an AI title as a LeetCode exemption is a fast reject.
| Algorithm puzzle | “Build a small system” | AI primitive | |
|---|---|---|---|
| OpenAI | no | yes | no |
| Anthropic | no | yes | no |
| xAI | yes | no | no |
| Amazon GenAI | yes | no | yes |
| Databricks | yes | yes | no |
| Meta / LinkedIn / Shopify | yes | no | separate round |
Take-homes
The take-home is increasingly the highest-weight round: a 2-to-7-day build followed by a defense walkthrough where interviewers push on every decision. The build matters less than whether you can justify it. A few real reported shapes:
RAG bot over PDFs
with citations
Ingest documents, answer questions, cite sources. The canonical take-home. Graded on retrieval quality, citation correctness, and whether you built an eval.
Refactor a messy RAG app
preserve behavior
Take a working-but-ugly codebase into clean architecture without changing API behavior. Tests judgment and restraint more than greenfield flair.
Agent with real tools
DB + docs + bash
Build an agent with database access, doc-search, and a bash tool — where bash requires explicit human approval. Tests tool design and a safety boundary.
Multi-agent pipeline
staged
A staged flow such as Spec → Story → LLM-Judge → Rewrite. Tests orchestration, hand-offs, and whether each stage is independently checkable.
LLM routing + caching
100+ req/s
Route across providers, multi-level caching, provider failover at scale. A systems take-home wearing an LLM hat — cost and resilience are the whole point.
The AI-assisted coding round
The defining 2025–26 trend — at Meta, Sierra, Cursor, and Notion — is coding with a model in the room, on a multi-file task larger than you could hand-write in the time. The signal inverts: it is no longer "can you write this from memory" but "can you drive, review, and verify faster than the model can mislead you."
When the model can produce volume, the scarce skill is judgment: knowing what to ask for, recognizing when the output is subtly wrong, and refusing to ship code you cannot defend. This is the same muscle as agent verification — you are the verifier in the loop.
- Drive the model on a larger task: decompose it, specify each part, and keep the architecture in your head — not the model’s.
- Review every diff. Candidates who paste model output without catching its mistakes fail; the bug it introduced is the test.
- Verify against intent, not vibes — run it, read it, check the edge cases the model skipped.
Prep checklist
A concrete, do-this list. Work top to bottom; the early rows are the floor, the later rows are role-dependent.
Heading into the final stretch? Compress everything into the cheat sheet for the night before.