Cursor
A full VS Code fork with AI as a core architectural component. Multi-model routing across custom MoE models, frontier LLMs, and fine-tuned apply models. Proprietary RAG pipeline backed by 100B+ vector embeddings. Handles over 1M transactions per second at peak.
Architecture
Multi-Model Routing
Different features use different models optimized for their specific task:
| Feature | Model | Purpose |
|---|---|---|
| Tab Completion | Custom MoE (in-house) | Low-latency autocomplete, ~100 candidates per keystroke |
| Chat | User-selected frontier (Claude, GPT-4, etc.) | Complex reasoning and Q&A |
| Agent / Composer | Composer (custom MoE, RL-trained) | Agentic coding with tool use |
| Fast Apply | Fine-tuned Llama-3-70b | Converting edits to full-file rewrites at ~1000 tok/s |
| Embeddings | Custom embedding model | Codebase indexing |
| Compaction | Haiku or similar | Summarizing conversation history |
An Auto mode analyzes request complexity and routes to the optimal model dynamically.
Tab Completion
Uses a custom Mixture of Experts model designed for long input prompts (extensive code context) but short output (predicted edit). Key behaviors:
- Generates ~100 candidates and uses RL to predict which the user would prefer
- Predicts not just the next token but the next complete edit — multi-line changes and cursor jumps
- Simple insertions appear as ghost text; multi-line changes appear as a diff pop-up
- After accepting, highlights the next logical edit location for a "tab-tab-tab" flow
- KV cache warming: proactively warms the cache with current file contents as the user types, so generation starts with minimal compute when triggered
Agent Mode (Composer)
The primary agentic interface. Powered by the Composer model — Cursor's proprietary MoE model trained on coding trajectories with access to real development tools during training.
ReAct-style loop: the model decides the next action and tool, the orchestrator executes it, collects the result, and feeds it back. Up to 25 tool calls before pausing for user review.
| Tool | Function |
|---|---|
codebase_search | Semantic search over indexed codebase |
grep_search | Literal text search |
file_search | Find files by name or path |
read_file | Read file contents (200-250 lines at a time) |
write_file | Modify files |
run_command | Execute terminal commands |
reapply | Retry edit with a more expensive model |
Parallel agents: up to 8 agents can run simultaneously, each in an isolated git worktree. Background agents run in sandboxed cloud environments.
RAG Pipeline
A five-step indexing and retrieval system:
1. Chunking — Tree-sitter parses code into AST nodes. Sibling nodes are merged into semantically meaningful chunks within token limits.
2. Merkle Tree Sync — A Merkle tree of file hashes detects changes. Only modified files are re-indexed (every 5-10 minutes).
3. Embedding — Chunks are embedded using Cursor's proprietary embedding model.
4. Vector Storage — Turbopuffer stores 100B+ vectors across 10M+ namespaces. One namespace per (user, codebase) pair. Tiered storage: active namespaces in memory/NVMe, inactive in object storage. Peak write throughput: 10GB/s.
5. Retrieval — User query is embedded, sent to Turbopuffer for nearest-neighbor search, returns file paths and line ranges. Chunks are loaded locally and assembled into the prompt.
File paths are encrypted per-segment on the client before transmission. No code is persistently stored on servers.
Speculative Edits
Cursor's key performance innovation for the Apply model:
Instead of a small draft model (traditional speculative decoding), a deterministic algorithm speculates that output tokens will match the original code. During rewrites, most output is identical to the original — the system feeds original code chunks and the model mostly agrees until reaching a change point.
- ~1000 tokens/second on the 70B apply model
- ~13x speedup over vanilla Llama-3-70b inference
- Full-file rewrites chosen over diffs because LLMs see far more full-file examples in pre-training
The reapply tool escalates to a more expensive model when the Apply model fails on large or complex files.
Shadow Workspace
A hidden Electron window runs in the background. When the AI suggests code changes, the shadow workspace applies them and runs the linter/LSP. If errors are found, the AI fixes them before showing results to the user.
This creates the illusion that the AI never makes syntax mistakes. The validation happens in milliseconds.
Context Management
Prompt construction uses a JSX-like component system called "Preempt" where components receive priority assignments. A renderer fits content to available windows with distance-based priority decay from cursor position.
Context compaction monitors token usage and summarizes older messages when approaching limits. Retains key signals (failing test names, error types, stack frames) while compressing verbose output.
Cursor Rules — project-level .cursorrules files specify coding conventions and constraints, injected into every prompt.
Patterns Used
| Pattern | How It's Used |
|---|---|
| ReAct | Agent mode's reason-act-observe loop |
| RAG | Five-step codebase indexing with Turbopuffer |
| Router | Multi-model routing based on task type |
| Tool Router | Agent selects from 10+ tools via function calling |
| Parallel | Up to 8 agents in isolated git worktrees |
| Pipeline | Two-stage plan-then-apply for code edits |
| Streaming | Real-time token delivery for all modes |
| Conversation Summarization | Context compaction for long agent sessions |