Skip to main content

Perplexity

An answer engine that combines real-time web search with LLM generation through a five-stage RAG pipeline. Model-agnostic orchestration across proprietary Sonar models and frontier LLMs. Indexes 200B+ URLs via Vespa.ai. Serves 400M+ queries per month.


Architecture


Five-Stage RAG Pipeline

1. Query Intent Parsing — an LLM parses the user's intent at a semantic level, understanding context and nuance beyond keywords.

2. Live Web Retrieval — the parsed query hits Vespa.ai's real-time index. 200B+ unique URLs, tens of thousands of CPUs, 400+ petabytes of hot storage, tens of thousands of index updates per second. Also uses on-demand crawling for freshest results.

3. Snippet Extraction — rather than full pages, the system extracts the most relevant paragraphs. Documents are chunked so Vespa scores individual paragraphs by relevance.

4. Synthesized Answer with Citations — the curated context is passed to the LLM. The guiding principle: "you are not supposed to say anything that you didn't retrieve." Citations are generated during generation, not post-processed.

5. Conversational Refinement — conversation context enables follow-up questions with iterative searches.


Ranking Pipeline

Multi-stage ranking under tight latency budgets:

StageMethodPurpose
Hybrid RetrievalDense (vector) + Sparse (BM25)Cast a wide net with both semantic and keyword matching
Early ScoringLexical + embedding scorersFast candidate narrowing
Cross-Encoder RerankingTransformer-based rerankerFinal result sculpting using authority, freshness, engagement signals

Source Validation

Citations are not post-processing — they're tightly coupled with generation:

  • Inline footnotes link to expandable snippets with title and favicon
  • Claims map to source URLs during generation
  • Cross-referencing: when 3+ authoritative sources independently confirm information, confidence increases
  • Contradictions trigger additional verification; multiple perspectives are cited when legitimate disagreement exists
  • Authority scoring, freshness signals, and credibility weighting applied to all sources

Deep Research

An agentic capability for complex research tasks:

  • Iteratively searches, reads documents, and reasons about what to do next
  • Performs dozens of searches automatically, reads hundreds of sources
  • Delivers comprehensive reports in 2-4 minutes
  • Achieved 21.1% on Humanity's Last Exam benchmark (highest among all models at time of release)

Sonar Models

Perplexity's proprietary model family:

  • Base: Fine-tuned from LLaMA 3.3 70B for factuality and readability
  • Multi-Token Prediction: MTP heads trained on Perplexity's datasets using 8xH100 nodes
  • Speculative Decoding: Llama-1B draft model fine-tuned on the same dataset accelerates inference
  • Speed: ~1,200 tokens/second via Cerebras inference infrastructure
  • Inference Engine: ROSE, built in Python/PyTorch with Rust for performance-critical paths

Patterns Used

PatternHow It's Used
RAGCore five-stage retrieval-augmented generation pipeline
CitationInline source attribution tightly coupled with generation
PipelineParse > Retrieve > Extract > Generate > Cite sequential flow
RouterModel-agnostic routing to Sonar or frontier models based on query
GuardrailsCross-source validation and authority scoring
StreamingReal-time token delivery for answers