Perplexity

An answer engine that combines real-time web search with LLM generation through a five-stage RAG pipeline. Model-agnostic orchestration across proprietary Sonar models and frontier LLMs. Indexes 200B+ URLs via Vespa.ai. Serves 400M+ queries per month.

Architecture

Five-Stage RAG Pipeline

1. Query Intent Parsing — an LLM parses the user's intent at a semantic level, understanding context and nuance beyond keywords.

2. Live Web Retrieval — the parsed query hits Vespa.ai's real-time index. 200B+ unique URLs, tens of thousands of CPUs, 400+ petabytes of hot storage, tens of thousands of index updates per second. Also uses on-demand crawling for freshest results.

3. Snippet Extraction — rather than full pages, the system extracts the most relevant paragraphs. Documents are chunked so Vespa scores individual paragraphs by relevance.

4. Synthesized Answer with Citations — the curated context is passed to the LLM. The guiding principle: "you are not supposed to say anything that you didn't retrieve." Citations are generated during generation, not post-processed.

5. Conversational Refinement — conversation context enables follow-up questions with iterative searches.

Ranking Pipeline

Multi-stage ranking under tight latency budgets:

Stage	Method	Purpose
Hybrid Retrieval	Dense (vector) + Sparse (BM25)	Cast a wide net with both semantic and keyword matching
Early Scoring	Lexical + embedding scorers	Fast candidate narrowing
Cross-Encoder Reranking	Transformer-based reranker	Final result sculpting using authority, freshness, engagement signals

Source Validation

Citations are not post-processing — they're tightly coupled with generation:

Inline footnotes link to expandable snippets with title and favicon
Claims map to source URLs during generation
Cross-referencing: when 3+ authoritative sources independently confirm information, confidence increases
Contradictions trigger additional verification; multiple perspectives are cited when legitimate disagreement exists
Authority scoring, freshness signals, and credibility weighting applied to all sources

Deep Research

An agentic capability for complex research tasks:

Iteratively searches, reads documents, and reasons about what to do next
Performs dozens of searches automatically, reads hundreds of sources
Delivers comprehensive reports in 2-4 minutes
Achieved 21.1% on Humanity's Last Exam benchmark (highest among all models at time of release)

Sonar Models

Perplexity's proprietary model family:

Base: Fine-tuned from LLaMA 3.3 70B for factuality and readability
Multi-Token Prediction: MTP heads trained on Perplexity's datasets using 8xH100 nodes
Speculative Decoding: Llama-1B draft model fine-tuned on the same dataset accelerates inference
Speed: ~1,200 tokens/second via Cerebras inference infrastructure
Inference Engine: ROSE, built in Python/PyTorch with Rust for performance-critical paths

Patterns Used

Pattern	How It's Used
RAG	Core five-stage retrieval-augmented generation pipeline
Citation	Inline source attribution tightly coupled with generation
Pipeline	Parse > Retrieve > Extract > Generate > Cite sequential flow
Router	Model-agnostic routing to Sonar or frontier models based on query
Guardrails	Cross-source validation and authority scoring
Streaming	Real-time token delivery for answers

Architecture​

Five-Stage RAG Pipeline​

Ranking Pipeline​

Source Validation​

Deep Research​

Sonar Models​

Patterns Used​