Cursor

A full VS Code fork with AI as a core architectural component. Multi-model routing across custom MoE models, frontier LLMs, and fine-tuned apply models. Proprietary RAG pipeline backed by 100B+ vector embeddings. Handles over 1M transactions per second at peak.

Architecture

Multi-Model Routing

Different features use different models optimized for their specific task:

Feature	Model	Purpose
Tab Completion	Custom MoE (in-house)	Low-latency autocomplete, ~100 candidates per keystroke
Chat	User-selected frontier (Claude, GPT-4, etc.)	Complex reasoning and Q&A
Agent / Composer	Composer (custom MoE, RL-trained)	Agentic coding with tool use
Fast Apply	Fine-tuned Llama-3-70b	Converting edits to full-file rewrites at ~1000 tok/s
Embeddings	Custom embedding model	Codebase indexing
Compaction	Haiku or similar	Summarizing conversation history

An Auto mode analyzes request complexity and routes to the optimal model dynamically.

Tab Completion

Uses a custom Mixture of Experts model designed for long input prompts (extensive code context) but short output (predicted edit). Key behaviors:

Generates ~100 candidates and uses RL to predict which the user would prefer
Predicts not just the next token but the next complete edit — multi-line changes and cursor jumps
Simple insertions appear as ghost text; multi-line changes appear as a diff pop-up
After accepting, highlights the next logical edit location for a "tab-tab-tab" flow
KV cache warming: proactively warms the cache with current file contents as the user types, so generation starts with minimal compute when triggered

Agent Mode (Composer)

The primary agentic interface. Powered by the Composer model — Cursor's proprietary MoE model trained on coding trajectories with access to real development tools during training.

ReAct-style loop: the model decides the next action and tool, the orchestrator executes it, collects the result, and feeds it back. Up to 25 tool calls before pausing for user review.

Tool	Function
`codebase_search`	Semantic search over indexed codebase
`grep_search`	Literal text search
`file_search`	Find files by name or path
`read_file`	Read file contents (200-250 lines at a time)
`write_file`	Modify files
`run_command`	Execute terminal commands
`reapply`	Retry edit with a more expensive model

Parallel agents: up to 8 agents can run simultaneously, each in an isolated git worktree. Background agents run in sandboxed cloud environments.

RAG Pipeline

A five-step indexing and retrieval system:

1. Chunking — Tree-sitter parses code into AST nodes. Sibling nodes are merged into semantically meaningful chunks within token limits.

2. Merkle Tree Sync — A Merkle tree of file hashes detects changes. Only modified files are re-indexed (every 5-10 minutes).

3. Embedding — Chunks are embedded using Cursor's proprietary embedding model.

4. Vector Storage — Turbopuffer stores 100B+ vectors across 10M+ namespaces. One namespace per (user, codebase) pair. Tiered storage: active namespaces in memory/NVMe, inactive in object storage. Peak write throughput: 10GB/s.

5. Retrieval — User query is embedded, sent to Turbopuffer for nearest-neighbor search, returns file paths and line ranges. Chunks are loaded locally and assembled into the prompt.

File paths are encrypted per-segment on the client before transmission. No code is persistently stored on servers.

Speculative Edits

Cursor's key performance innovation for the Apply model:

Instead of a small draft model (traditional speculative decoding), a deterministic algorithm speculates that output tokens will match the original code. During rewrites, most output is identical to the original — the system feeds original code chunks and the model mostly agrees until reaching a change point.

~1000 tokens/second on the 70B apply model
~13x speedup over vanilla Llama-3-70b inference
Full-file rewrites chosen over diffs because LLMs see far more full-file examples in pre-training

The reapply tool escalates to a more expensive model when the Apply model fails on large or complex files.

Shadow Workspace

A hidden Electron window runs in the background. When the AI suggests code changes, the shadow workspace applies them and runs the linter/LSP. If errors are found, the AI fixes them before showing results to the user.

This creates the illusion that the AI never makes syntax mistakes. The validation happens in milliseconds.

Context Management

Prompt construction uses a JSX-like component system called "Preempt" where components receive priority assignments. A renderer fits content to available windows with distance-based priority decay from cursor position.

Context compaction monitors token usage and summarizes older messages when approaching limits. Retains key signals (failing test names, error types, stack frames) while compressing verbose output.

Cursor Rules — project-level .cursorrules files specify coding conventions and constraints, injected into every prompt.

Patterns Used

Pattern	How It's Used
ReAct	Agent mode's reason-act-observe loop
RAG	Five-step codebase indexing with Turbopuffer
Router	Multi-model routing based on task type
Tool Router	Agent selects from 10+ tools via function calling
Parallel	Up to 8 agents in isolated git worktrees
Pipeline	Two-stage plan-then-apply for code edits
Streaming	Real-time token delivery for all modes
Conversation Summarization	Context compaction for long agent sessions

Architecture​

Multi-Model Routing​

Tab Completion​

Agent Mode (Composer)​

RAG Pipeline​

Speculative Edits​

Shadow Workspace​

Context Management​

Patterns Used​