Using Datablocks for Inference
Learn how to use your trained datablocks to achieve fast, cost-effective long-context inference.
Overview
Once you've trained a datablock, you can use it across millions of queries without reprocessing the original context. Simply reference the datablock ID in your inference requests, and the pre-computed KV cache will be loaded instantly.
Key Benefits
- →First load: Takes 2-5 seconds to load the KV cache into GPU memory
- →Subsequent queries: Near-instant context loading (milliseconds)
- →Cost savings: Pay datablock rates instead of processing 100K+ input tokens
Basic Inference with Datablocks
Use the chat completions API with the datablocks parameter:
import requests
response = requests.post(
"/api/v1/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"model": "qwen",
"messages": [
{
"role": "user",
"content": "What are the key findings about neural scaling laws?"
}
],
"datablocks": [
{
"id": "your-datablock-id",
"source": "wandb" # or "local"
}
]
}
)
answer = response.json()["choices"][0]["message"]["content"]
print(answer)OpenAI-Compatible API
Our API follows the OpenAI chat completions format with the addition of the datablocks parameter. This makes it easy to integrate into existing applications.
Using Multiple Datablocks
You can reference multiple datablocks in a single request to combine context from different sources:
response = requests.post(
"/api/v1/chat/completions",
json={
"model": "qwen",
"messages": [
{
"role": "user",
"content": "Compare the findings from both research papers."
}
],
"datablocks": [
{
"id": "research-paper-1",
"source": "wandb"
},
{
"id": "research-paper-2",
"source": "wandb"
}
]
}
)The model will have access to the full context from all datablocks, allowing it to reason across multiple documents efficiently.
Caching and Performance
Server-Side Caching
Datablocks are automatically cached on our servers after the first load. Subsequent requests using the same datablock will benefit from near-instant loading times.
Cache Retention
Frequently used datablocks stay in the cache longer. Cache retention is based on:
- Usage frequency (more frequent = longer retention)
- Datablock size (smaller datablocks are easier to cache)
- Available GPU memory across the cluster
Optimizing for Performance
For production workloads with consistent traffic, your datablocks will remain cached, delivering consistently fast inference times. Consider batching related queries to maximize cache hits.
Inference Pricing
| Component | Cost | Notes |
|---|---|---|
| Datablock load (first time) | $0.001 | One-time per cache session |
| Datablock load (cached) | $0.0001 | Near-instant loading |
| User query tokens | Standard input rate | Only your prompt, not the context |
| Output tokens | Standard output rate | Same as normal inference |
Cost Savings Example
Comparing traditional vs. datablock-based inference for 1000 queries on a 100K token document:
Traditional Approach
100K tokens × 1000 queries × $0.60/1M = $60.00
With Datablocks
$0.001 first load + (999 × $0.0001) = $0.10
99.8% cost reduction on context processing
Advanced Usage Patterns
Batch Processing
Process multiple queries against the same datablock in rapid succession:
- →First query loads the datablock into cache
- →Subsequent queries use cached version
- →Maximize throughput and minimize costs
Multi-Turn Conversations
Maintain conversation history while using datablocks:
- →Include previous messages in the messages array
- →Datablock provides base context
- →Conversation history adds specific context
Streaming Responses
Stream responses for real-time user experience:
Dynamic Datablock Selection
Choose datablocks based on user queries:
- →Route queries to relevant datablocks
- →Combine multiple datablocks as needed
- →Build intelligent retrieval systems
Inference Best Practices
Optimize for Caching
- Reuse the same datablocks across multiple queries
- Batch related queries together to maintain cache hits
- For production apps, maintain consistent traffic patterns
Query Optimization
- Keep user prompts concise - you only pay for prompt tokens
- Let the datablock provide the context automatically
- Use specific questions to get targeted answers
Error Handling
- Handle cold-start delays gracefully (first load may take 2-5s)
- Implement retry logic for transient failures
- Monitor datablock availability and status