Coding Rounds

The coding interview for an AI engineer is the round people most consistently mis-prepare. They drill LeetCode and get blindsided by "build a RAG pipeline in 45 minutes," or they live in notebooks all year and freeze on an LRU cache. The format is bimodal: nearly every loop pairs an AI-specific exercise — retrieval, an agent loop, attention from memory — with a classic data-structures round that is alive and well even at the frontier labs. Prepare for exactly one half and you will fail the other.

The AI-specific half is more predictable than candidates expect. The same dozen exercises recur across companies, and the single most common one — build a small RAG pipeline end to end — can be rehearsed to muscle memory. The DSA half has drifted toward "build a small system" variants (in-memory databases, versioned key-value stores) rather than pure algorithm puzzles. Both halves reward writing clean, working code under a clock more than cleverness.

bimodal

AI-specific exercise AND a classic DSA round in the same loop

RAG

build-a-pipeline is the single most common AI coding task

~75

LeetCode easy/medium is a realistic DSA floor

verify

the AI-assisted round grades judgment, not typing speed

The AI-specific exercises

A small set of exercises recurs across loops. They cluster into three families — retrieval and agents (applied roles), transformer primitives (research-leaning roles), and production plumbing (the criteria you are graded on even when they are not the named problem). Learn the family that matches your target role first.

Recurring AI coding exercisesdifficulty is rough — calibrated to a 45–60 min live round

RAG pipeline from scratchruntime

Chunk → embed → cosine top-k over an in-memory store → assemble context under a token budget → answer with citations.

The single most common AI coding task. Tests whether you understand retrieval as plumbing, not magic. Medium.

Agent / ReAct loopruntime

Tool registry plus dispatcher, an observe → act → decide cycle, a STOP condition and a step cap.

Probes whether you keep control flow in code rather than hoping the model halts. Medium. (Sketched below.)

Semantic search top-kruntime

Embed a query, cosine-similarity against a matrix of doc embeddings, return the k nearest.

Amazon GenAI explicitly asks cosine-similarity in NumPy. The core of RAG, isolated. Easy. (Sketched below.)

Self-attention in NumPyprocess

Scaled dot-product attention plus softmax, written from memory.

More common on ML / research loops — Mistral, DeepMind. Tests that you actually know the math. Medium–hard.

Decoding strategiesprocess

Temperature scaling, top-k, and top-p / nucleus sampling over a logits vector.

Reveals whether you understand what the sampler does, not just which knob to turn. Medium.

KV cacheruntime

Implement a cache for past keys/values, or size-calculate its memory footprint.

A favorite at inference-heavy shops. Connects model internals to cost and latency. Medium.

BPE tokenizer from scratchprocess

Byte-pair merges, train a vocab, encode and decode.

Raschka's walkthrough is the de-facto prep. Tests fundamentals patience. Medium–hard.

Structured-output validate + repairruntime

Parse model JSON, validate against a schema, classify VALID / INVALID / UNCLEAR, repair-and-retry on failure.

The everyday reality of shipping LLM output. Tests defensive parsing. Medium.

Resilience plumbingcost

Retry-with-backoff, a token-bucket rate limiter, a concurrency-limited worker pool for LLM calls.

Often an evaluation CRITERION even when not a named problem. Skipping it reads as not having shipped. Medium.

Streaming / SSE handlingruntime

Consume a token stream, handle partial-JSON parsing across chunks.

Shows up wherever latency matters. Tests stateful parsing under partial data. Medium.

LLM-as-judge eval harnessprocess

Score model outputs against a rubric, aggregate, report a metric.

The eval-fluency signal in code form. Know RAGAS by name. Medium.

Chatbot with memory / stateruntime

Multi-turn conversation with persisted history and a context-window policy.

A warm-up at many shops. Tests state management more than ML. Easy–medium.

If the role is applied, drill the runtime-tagged rows first. If it is research-leaning, the process-tagged transformer primitives carry more weight. The cost-tagged plumbing is graded everywhere.

Two exercises, worked

These two sketches are the centerpiece — the exercises most likely to decide an applied loop. Both are short enough to reproduce live and idiomatic enough to defend line by line.

1. A ReAct agent loop from scratch

The interviewer wants to see that the control flow lives in your code, not in the model's good intentions. A tool registry, a parse-dispatch-observe cycle, an explicit final-answer stop condition, and a hard step cap so a confused model cannot loop forever.

import json

# Tool registry: name -> callable. The dispatcher only runs what is registered.
TOOLS = {
    "search": lambda q: f"results for {q}",
    "calculator": lambda expr: str(eval(expr, {"__builtins__": {}})),
}

SYSTEM = (
    "Solve the task. Each turn, emit JSON: "
    '{"tool": <name>, "args": <string>} to call a tool, '
    'or {"final": <answer>} when done.'
)

def run_agent(task, max_steps=6):
    messages = [{"role": "system", "content": SYSTEM},
                {"role": "user", "content": task}]

    for _ in range(max_steps):              # step cap: the loop is bounded by code
        raw = llm(messages)                 # model call (stubbed)
        messages.append({"role": "assistant", "content": raw})

        try:
            action = json.loads(raw)
        except json.JSONDecodeError:        # malformed output is an observation, not a crash
            messages.append({"role": "user", "content": "Invalid JSON. Re-emit."})
            continue

        if "final" in action:               # stop condition: the model declares it is done
            return action["final"]

        tool, args = action.get("tool"), action.get("args")
        if tool not in TOOLS:               # guard the dispatcher against unknown tools
            obs = f"Unknown tool: {tool}"
        else:
            try:
                obs = TOOLS[tool](args)
            except Exception as e:          # tool failure feeds back as an observation
                obs = f"Tool error: {e}"

        messages.append({"role": "user", "content": f"Observation: {obs}"})

    return "Stopped: step cap reached."     # the cap is the safety net, not the happy path

What the interviewer is probing

Do you bound the loop with a step cap and a real stop condition, or does it run until the model happens to quit? See budgets and halting and the agent loop.
Do tool errors and bad JSON become observations fed back to the model, or do they crash the run? Resilience is the signal.
Is the dispatcher guarded — unknown tools handled, eval sandboxed — rather than trusting model output blindly?

2. A minimal retrieval function

The heart of RAG, isolated: embed the query, score it against a matrix of document embeddings with cosine similarity, return the top-k chunks. Note where a reranker would slot in.

import numpy as np

def retrieve(query, chunks, doc_embeddings, embed_fn, k=3):
    # doc_embeddings: (N, D) matrix, row i is the embedding of chunks[i]
    q = embed_fn(query)                                  # (D,)

    # Cosine similarity = normalized dot product. Normalize once, then matmul.
    q = q / (np.linalg.norm(q) + 1e-8)
    docs = doc_embeddings / (np.linalg.norm(doc_embeddings, axis=1, keepdims=True) + 1e-8)
    scores = docs @ q                                    # (N,) similarity per chunk

    top = np.argsort(scores)[::-1][:k]                   # indices of the k highest scores
    return [(chunks[i], float(scores[i])) for i in top]

    # Rerank slots in here: pull a wider top-k (say 20) above, then re-score those
    # candidates with a cross-encoder and keep the best k before returning.

What the interviewer is probing

Do you know cosine similarity is a normalized dot product, and can you vectorize it instead of looping over documents?
Do you guard the divide-by-zero on a zero vector, and return scores so the caller can threshold?
Do you know where reranking goes — retrieve wide, rerank narrow — and why a bi-encoder top-k feeds a cross-encoder, not the other way around?

Does classic DSA still appear?

Yes — emphatically, even at AI-first labs. The flavor has shifted: labs favor "build a small system" variants — in-memory databases, versioned key-value stores, crawlers, iterators — over pure algorithm puzzles. But the round is real, and treating an AI title as a LeetCode exemption is a fast reject.

	Algorithm puzzle	“Build a small system”	AI primitive
OpenAI	no	yes	no
Anthropic	no	yes	no
xAI	yes	no	no
Amazon GenAI	yes	no	yes
Databricks	yes	yes	no
Meta / LinkedIn / Shopify	yes	no	separate round

OpenAI reports an in-memory DB with SQL-like ops, a time-based / versioned key-value store, and refactoring deeply-nested code. Anthropic runs a 4-level progressive KV database. xAI asks LRU cache in O(1). Amazon GenAI pairs a standard LeetCode problem WITH cosine-similarity in NumPy. Databricks does LeetCode then multi-level OOP. Meta / LinkedIn / Shopify keep a classic algorithm round and add a separate AI-assisted round.

The system-y variants worth drilling coldthese recur far more than tree/graph puzzles at AI shops

In-memory databaseSQL-like SET / GET / SCAN over rows. OpenAI's reported staple — it scales cleanly into follow-up requirements.

Progressive key-value storeAnthropic's 4-level build: SET/GET → SCAN with prefix → timestamps and TTL → compaction. Each level layers on the last.

Time-based / versioned KVStore values at timestamps, query as-of. The classic "TimeMap" — appears at OpenAI and as a LeetCode medium.

LRU cache in O(1)Hash map plus doubly-linked list. xAI's reported ask and a perennial. Know it without thinking.

Serialize / deserializeA tree or a small structure, round-tripped. Tests careful state handling under a format.

Recommended floor: ~75 LeetCode easy/medium for fundamentals, then drill these system-y variants until you can build each from a blank file in under 30 minutes.

Take-homes

The take-home is increasingly the highest-weight round: a 2-to-7-day build followed by a defense walkthrough where interviewers push on every decision. The build matters less than whether you can justify it. A few real reported shapes:

RAG bot over PDFs

with citations

Ingest documents, answer questions, cite sources. The canonical take-home. Graded on retrieval quality, citation correctness, and whether you built an eval.

Refactor a messy RAG app

preserve behavior

Take a working-but-ugly codebase into clean architecture without changing API behavior. Tests judgment and restraint more than greenfield flair.

Agent with real tools

DB + docs + bash

Build an agent with database access, doc-search, and a bash tool — where bash requires explicit human approval. Tests tool design and a safety boundary.

Multi-agent pipeline

staged

A staged flow such as Spec → Story → LLM-Judge → Rewrite. Tests orchestration, hand-offs, and whether each stage is independently checkable.

LLM routing + caching

100+ req/s

Route across providers, multi-level caching, provider failover at scale. A systems take-home wearing an LLM hat — cost and resilience are the whole point.

Recurring evaluation criteria: modular code, error handling, UNIT TESTS (often 80%+ coverage), production-readiness (caching, cost, PII, rate-limiting), and whether you built an EVAL HARNESS with defined metrics.

✓A fair, scoped take-home

Scoped to a focused build defensible in a walkthrough

Clear evaluation criteria stated up front

Realistic time budget — or paid if it is large

Tests judgment: a refactor, an eval, one good agent

✕A red flag (unpaid consulting)

A “72-hour round 1” demanding full RAG + agents + UI

No defined criteria — you guess what they want

Output suspiciously close to their actual roadmap

Unpaid, multi-day, and front-loaded before any human contact

A round-one take-home that requires a complete production system in 72 hours is effectively unpaid consulting. It is reasonable to ask whether it is paid, or to propose a scoped subset.

The AI-assisted coding round

The defining 2025–26 trend — at Meta, Sierra, Cursor, and Notion — is coding with a model in the room, on a multi-file task larger than you could hand-write in the time. The signal inverts: it is no longer "can you write this from memory" but "can you drive, review, and verify faster than the model can mislead you."

The signal shifts from writing to verifying

When the model can produce volume, the scarce skill is judgment: knowing what to ask for, recognizing when the output is subtly wrong, and refusing to ship code you cannot defend. This is the same muscle as agent verification — you are the verifier in the loop.

Drive the model on a larger task: decompose it, specify each part, and keep the architecture in your head — not the model’s.
Review every diff. Candidates who paste model output without catching its mistakes fail; the bug it introduced is the test.
Verify against intent, not vibes — run it, read it, check the edge cases the model skipped.

Prep checklist

A concrete, do-this list. Work top to bottom; the early rows are the floor, the later rows are role-dependent.

DSA floor

~75 LeetCode easy/medium + the system-y variants

Cover the fundamentals, then drill the build-a-small-system asks until cold: LRU cache, serialize/deserialize, time-based KV store, in-memory DB.

every variant from a blank file in under 30 min

Retrieval

RAG pipeline blank-to-working in under 60 min

Chunk → embed → cosine top-k → assemble context under a token budget → answer with citations. Rehearse until it is muscle memory.

one working pipeline, no framework

Agents

Agent loop with no framework, then redo in LangGraph

Hand-roll the ReAct loop above — registry, dispatcher, stop condition, step cap. Then rebuild it in LangGraph to speak both fluently.

two implementations of the same agent

Plumbing

Hand-roll resilience primitives

Retry-with-backoff, a token-bucket rate limiter, an async concurrency-limited worker pool, and structured-output validation with a repair/retry loop.

a reusable llm-call wrapper

Internals

Transformer primitives from memory (if research-leaning)

Softmax, scaled dot-product attention, top-k / top-p sampling, KV cache. Write them in NumPy without references.

a from-memory primitives notebook

Evals

An LLM-as-judge harness with defined metrics

Score outputs against a rubric, aggregate, report a number. Know RAGAS by name and what it measures.

a scoring harness you can defend

Streaming

SSE / streaming token handling

Consume a token stream and parse partial JSON across chunks without losing state.

a streaming consumer

AI-assisted

Practice an AI-assisted round on a multi-file task

Take a real multi-file repo and drive a model to extend it — then review and verify its output as if graded on the catches you make.

a reviewed, defended diff

Rows 1–4 are the applied-role floor. Row 5 is for ML / research-leaning loops. Rows 6–8 separate strong candidates from adequate ones.

AI Engineering Field Guide ↗

Alexey Grigorev·2025

A practitioner-maintained map of what AI engineering interviews actually test — useful for calibrating which exercises matter for which role.

AI Engineering Interview Questions ↗

Amit Shekhar (amitshekhariitbhu)·2025

A broad question bank spanning LLM internals, RAG, agents, and evals — good for breadth checks before a loop.

BPE Tokenizer From Scratch ↗

Sebastian Raschka·2025

The de-facto walkthrough for the byte-pair-encoding tokenizer exercise — work through it once and the from-scratch ask becomes routine.

Heading into the final stretch? Compress everything into the cheat sheet for the night before.

The AI-specific exercises​

Two exercises, worked​

1. A ReAct agent loop from scratch​

2. A minimal retrieval function​

Does classic DSA still appear?​

Take-homes​

The AI-assisted coding round​

Prep checklist​