Perplexity
An answer engine that combines real-time web search with LLM generation through a five-stage RAG pipeline. Model-agnostic orchestration across proprietary Sonar models and frontier LLMs. Indexes 200B+ URLs via Vespa.ai. Serves 400M+ queries per month.
Architecture
Five-Stage RAG Pipeline
1. Query Intent Parsing — an LLM parses the user's intent at a semantic level, understanding context and nuance beyond keywords.
2. Live Web Retrieval — the parsed query hits Vespa.ai's real-time index. 200B+ unique URLs, tens of thousands of CPUs, 400+ petabytes of hot storage, tens of thousands of index updates per second. Also uses on-demand crawling for freshest results.
3. Snippet Extraction — rather than full pages, the system extracts the most relevant paragraphs. Documents are chunked so Vespa scores individual paragraphs by relevance.
4. Synthesized Answer with Citations — the curated context is passed to the LLM. The guiding principle: "you are not supposed to say anything that you didn't retrieve." Citations are generated during generation, not post-processed.
5. Conversational Refinement — conversation context enables follow-up questions with iterative searches.
Ranking Pipeline
Multi-stage ranking under tight latency budgets:
| Stage | Method | Purpose |
|---|---|---|
| Hybrid Retrieval | Dense (vector) + Sparse (BM25) | Cast a wide net with both semantic and keyword matching |
| Early Scoring | Lexical + embedding scorers | Fast candidate narrowing |
| Cross-Encoder Reranking | Transformer-based reranker | Final result sculpting using authority, freshness, engagement signals |
Source Validation
Citations are not post-processing — they're tightly coupled with generation:
- Inline footnotes link to expandable snippets with title and favicon
- Claims map to source URLs during generation
- Cross-referencing: when 3+ authoritative sources independently confirm information, confidence increases
- Contradictions trigger additional verification; multiple perspectives are cited when legitimate disagreement exists
- Authority scoring, freshness signals, and credibility weighting applied to all sources
Deep Research
An agentic capability for complex research tasks:
- Iteratively searches, reads documents, and reasons about what to do next
- Performs dozens of searches automatically, reads hundreds of sources
- Delivers comprehensive reports in 2-4 minutes
- Achieved 21.1% on Humanity's Last Exam benchmark (highest among all models at time of release)
Sonar Models
Perplexity's proprietary model family:
- Base: Fine-tuned from LLaMA 3.3 70B for factuality and readability
- Multi-Token Prediction: MTP heads trained on Perplexity's datasets using 8xH100 nodes
- Speculative Decoding: Llama-1B draft model fine-tuned on the same dataset accelerates inference
- Speed: ~1,200 tokens/second via Cerebras inference infrastructure
- Inference Engine: ROSE, built in Python/PyTorch with Rust for performance-critical paths
Patterns Used
| Pattern | How It's Used |
|---|---|
| RAG | Core five-stage retrieval-augmented generation pipeline |
| Citation | Inline source attribution tightly coupled with generation |
| Pipeline | Parse > Retrieve > Extract > Generate > Cite sequential flow |
| Router | Model-agnostic routing to Sonar or frontier models based on query |
| Guardrails | Cross-source validation and authority scoring |
| Streaming | Real-time token delivery for answers |