Benchmarks

Understanding how we evaluate and measure long-context performance with industry-standard benchmarks.

NIAH (Needle In A Haystack)

The Needle In A Haystack (NIAH) benchmark is a fundamental test for evaluating a model's ability to retrieve specific information from long contexts. It simulates real-world scenarios where critical information is embedded within large documents.

How NIAH Works

1.
Insert the "Needle": A specific piece of information (fact, statement, or data point) is inserted at a random position within a long context.
2.
Create the "Haystack": The needle is surrounded by distractor text (usually unrelated documents or filler content) to create a long context ranging from 4K to 128K+ tokens.
3.
Ask a Question: The model is queried about the needle information without being told where it's located.
4.
Measure Success: Success is measured by whether the model can accurately retrieve and respond with the needle information.

Example NIAH Test

Needle (inserted at position ~50K tokens):

"The secret code to access the database is XJ-947-ALPHA-23."

Haystack:

[100K tokens of unrelated text about various topics: history, science, literature, etc.]

Question:

"What is the secret code to access the database?"

Expected Answer:

"XJ-947-ALPHA-23"

Why NIAH Matters

Tests Attention Mechanisms: Evaluates whether models can maintain attention across very long sequences.
Real-World Relevance: Mirrors common use cases like document Q&A, contract analysis, and codebase navigation.
Position Sensitivity: Reveals the "lost in the middle" phenomenon where models struggle with information in the middle of long contexts.

NIAH with Datablocks

Datablocks excel at NIAH benchmarks because the compressed KV cache maintains attention over the entire context. Traditional approaches suffer from:

  • Attention degradation at longer contexts (quadratic complexity)
  • Position bias (struggling with information in the middle)
  • High computational cost for repeated retrievals

With datablocks, you train once on the haystack and can perform unlimited needle retrieval queries efficiently.

MTOB (Multi-Token On Bench)

The Multi-Token On Bench (MTOB) benchmark evaluates a model's ability to process and reason over multiple related pieces of information across long contexts. Unlike NIAH which tests single fact retrieval, MTOB measures complex reasoning abilities.

How MTOB Works

1.
Distribute Information: Multiple related facts or data points are scattered throughout a long context at different positions.
2.
Add Complexity: Each piece of information may be partial, requiring integration of multiple sources to answer questions.
3.
Test Reasoning: Questions require synthesizing information from multiple locations, not just retrieving a single fact.
4.
Measure Multi-Hop: Success depends on the ability to perform multi-hop reasoning across the long context.

Example MTOB Test

Distributed Information:

Position 10K tokens: "Alice works in the engineering department."

Position 45K tokens: "The engineering department budget is $2.5 million."

Position 80K tokens: "Department budgets increased by 15% this year."

Multi-Hop Question:

"What was the engineering department's budget last year, and which department does Alice work in?"

Expected Reasoning:

1. Find Alice's department → Engineering (from position 10K)

2. Find Engineering's current budget → $2.5M (from position 45K)

3. Calculate last year's budget → $2.5M / 1.15 ≈ $2.17M (using info from position 80K)

Expected Answer:

"Alice works in the engineering department, which had a budget of approximately $2.17 million last year (before the 15% increase to the current $2.5 million)."

Key Differences from NIAH

AspectNIAHMTOB
InformationSingle fact/needleMultiple distributed facts
Task TypeRetrievalMulti-hop reasoning
ComplexitySimple lookupSynthesis & reasoning
Context Length4K - 128K tokens8K - 200K+ tokens
Real-World UseDocument search, fact checkingAnalysis, summarization, Q&A

Why MTOB Matters

Tests Reasoning Ability: Goes beyond simple retrieval to measure true understanding and reasoning capabilities.
Enterprise Scenarios: Reflects complex real-world tasks like contract analysis, financial audits, and research synthesis.
Context Integration: Measures how well models maintain coherent understanding across long documents.

MTOB with Datablocks

Datablocks are particularly powerful for MTOB tasks because:

  • The KV cache preserves relationships between all tokens in the context
  • Multi-hop reasoning is accelerated since the entire context is already processed
  • You can ask multiple complex questions without reprocessing the context each time
  • Cost-effective for scenarios requiring repeated analysis of the same documents

Other Long-Context Benchmarks

LongBench

A comprehensive benchmark suite covering 16 tasks across 6 categories: single-doc QA, multi-doc QA, summarization, few-shot learning, code completion, and synthetic tasks. Context lengths range from 6K to 200K tokens.

RULER

Retrieval-based evaluation with tasks like multi-hop QA, aggregation, and variable tracking. Tests context lengths from 4K to 128K tokens with increasing difficulty.

ZeroSCROLLS

Zero-shot long-context understanding benchmark focusing on summarization, QA, and citation prediction tasks with contexts averaging 10K+ tokens.

L-Eval

Long-form evaluation with 20+ tasks including closed-book QA, open-domain QA, and mathematical reasoning. Context lengths range from 3K to 8K tokens.

For detailed benchmark results and comparisons, visit our benchmark comparison page.

Next Steps