Benchmarks
Understanding how we evaluate and measure long-context performance with industry-standard benchmarks.
NIAH (Needle In A Haystack)
The Needle In A Haystack (NIAH) benchmark is a fundamental test for evaluating a model's ability to retrieve specific information from long contexts. It simulates real-world scenarios where critical information is embedded within large documents.
How NIAH Works
Example NIAH Test
Needle (inserted at position ~50K tokens):
"The secret code to access the database is XJ-947-ALPHA-23."
Haystack:
[100K tokens of unrelated text about various topics: history, science, literature, etc.]
Question:
"What is the secret code to access the database?"
Expected Answer:
"XJ-947-ALPHA-23"
Why NIAH Matters
NIAH with Datablocks
Datablocks excel at NIAH benchmarks because the compressed KV cache maintains attention over the entire context. Traditional approaches suffer from:
- Attention degradation at longer contexts (quadratic complexity)
- Position bias (struggling with information in the middle)
- High computational cost for repeated retrievals
With datablocks, you train once on the haystack and can perform unlimited needle retrieval queries efficiently.
MTOB (Multi-Token On Bench)
The Multi-Token On Bench (MTOB) benchmark evaluates a model's ability to process and reason over multiple related pieces of information across long contexts. Unlike NIAH which tests single fact retrieval, MTOB measures complex reasoning abilities.
How MTOB Works
Example MTOB Test
Distributed Information:
Position 10K tokens: "Alice works in the engineering department."
Position 45K tokens: "The engineering department budget is $2.5 million."
Position 80K tokens: "Department budgets increased by 15% this year."
Multi-Hop Question:
"What was the engineering department's budget last year, and which department does Alice work in?"
Expected Reasoning:
1. Find Alice's department → Engineering (from position 10K)
2. Find Engineering's current budget → $2.5M (from position 45K)
3. Calculate last year's budget → $2.5M / 1.15 ≈ $2.17M (using info from position 80K)
Expected Answer:
"Alice works in the engineering department, which had a budget of approximately $2.17 million last year (before the 15% increase to the current $2.5 million)."
Key Differences from NIAH
| Aspect | NIAH | MTOB |
|---|---|---|
| Information | Single fact/needle | Multiple distributed facts |
| Task Type | Retrieval | Multi-hop reasoning |
| Complexity | Simple lookup | Synthesis & reasoning |
| Context Length | 4K - 128K tokens | 8K - 200K+ tokens |
| Real-World Use | Document search, fact checking | Analysis, summarization, Q&A |
Why MTOB Matters
MTOB with Datablocks
Datablocks are particularly powerful for MTOB tasks because:
- The KV cache preserves relationships between all tokens in the context
- Multi-hop reasoning is accelerated since the entire context is already processed
- You can ask multiple complex questions without reprocessing the context each time
- Cost-effective for scenarios requiring repeated analysis of the same documents
Other Long-Context Benchmarks
LongBench
A comprehensive benchmark suite covering 16 tasks across 6 categories: single-doc QA, multi-doc QA, summarization, few-shot learning, code completion, and synthetic tasks. Context lengths range from 6K to 200K tokens.
RULER
Retrieval-based evaluation with tasks like multi-hop QA, aggregation, and variable tracking. Tests context lengths from 4K to 128K tokens with increasing difficulty.
ZeroSCROLLS
Zero-shot long-context understanding benchmark focusing on summarization, QA, and citation prediction tasks with contexts averaging 10K+ tokens.
L-Eval
Long-form evaluation with 20+ tasks including closed-book QA, open-domain QA, and mathematical reasoning. Context lengths range from 3K to 8K tokens.
For detailed benchmark results and comparisons, visit our benchmark comparison page.
Next Steps
Try Datablocks
Get started with datablocks to see how they perform on long-context tasks.
View Benchmarks
Compare performance across different benchmarks and models.
Best Practices
Learn how to optimize your datablocks for specific use cases.
Test in Playground
Experiment with long-context queries in our interactive playground.