Using Datablocks for Inference

Learn how to use your trained datablocks to achieve fast, cost-effective long-context inference.

Overview

Once you've trained a datablock, you can use it across millions of queries without reprocessing the original context. Simply reference the datablock ID in your inference requests, and the pre-computed KV cache will be loaded instantly.

Key Benefits

→First load: Takes 2-5 seconds to load the KV cache into GPU memory
→Subsequent queries: Near-instant context loading (milliseconds)
→Cost savings: Pay datablock rates instead of processing 100K+ input tokens

Basic Inference with Datablocks

Use the chat completions API with the datablocks parameter:

import requests

response = requests.post(
  "/api/v1/chat/completions",
  headers={
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
  },
  json={
    "model": "qwen",
    "messages": [
      {
        "role": "user",
        "content": "What are the key findings about neural scaling laws?"
      }
    ],
    "datablocks": [
      {
        "id": "your-datablock-id",
        "source": "wandb"  # or "local"
      }
    ]
  }
)

answer = response.json()["choices"][0]["message"]["content"]
print(answer)

OpenAI-Compatible API

Our API follows the OpenAI chat completions format with the addition of the datablocks parameter. This makes it easy to integrate into existing applications.

Using Multiple Datablocks

You can reference multiple datablocks in a single request to combine context from different sources:

response = requests.post(
  "/api/v1/chat/completions",
  json={
    "model": "qwen",
    "messages": [
      {
        "role": "user",
        "content": "Compare the findings from both research papers."
      }
    ],
    "datablocks": [
      {
        "id": "research-paper-1",
        "source": "wandb"
      },
      {
        "id": "research-paper-2",
        "source": "wandb"
      }
    ]
  }
)

The model will have access to the full context from all datablocks, allowing it to reason across multiple documents efficiently.

Caching and Performance

Server-Side Caching

Datablocks are automatically cached on our servers after the first load. Subsequent requests using the same datablock will benefit from near-instant loading times.

Cache Retention

Frequently used datablocks stay in the cache longer. Cache retention is based on:

Usage frequency (more frequent = longer retention)
Datablock size (smaller datablocks are easier to cache)
Available GPU memory across the cluster

Optimizing for Performance

For production workloads with consistent traffic, your datablocks will remain cached, delivering consistently fast inference times. Consider batching related queries to maximize cache hits.

Inference Pricing

Component	Cost	Notes
Datablock load (first time)	$0.001	One-time per cache session
Datablock load (cached)	$0.0001	Near-instant loading
User query tokens	Standard input rate	Only your prompt, not the context
Output tokens	Standard output rate	Same as normal inference

Cost Savings Example

Comparing traditional vs. datablock-based inference for 1000 queries on a 100K token document:

Traditional Approach

100K tokens × 1000 queries × $0.60/1M = $60.00

With Datablocks

$0.001 first load + (999 × $0.0001) = $0.10

99.8% cost reduction on context processing

Advanced Usage Patterns

Batch Processing

Process multiple queries against the same datablock in rapid succession:

→First query loads the datablock into cache
→Subsequent queries use cached version
→Maximize throughput and minimize costs

Multi-Turn Conversations

Maintain conversation history while using datablocks:

→Include previous messages in the messages array
→Datablock provides base context
→Conversation history adds specific context

Streaming Responses

Stream responses for real-time user experience:

{ "model": "qwen", "messages": [...], "datablocks": [...], "stream": true }

Dynamic Datablock Selection

Choose datablocks based on user queries:

→Route queries to relevant datablocks
→Combine multiple datablocks as needed
→Build intelligent retrieval systems

Inference Best Practices

Optimize for Caching

Reuse the same datablocks across multiple queries
Batch related queries together to maintain cache hits
For production apps, maintain consistent traffic patterns

Query Optimization

Keep user prompts concise - you only pay for prompt tokens
Let the datablock provide the context automatically
Use specific questions to get targeted answers

Error Handling

Handle cold-start delays gracefully (first load may take 2-5s)
Implement retry logic for transient failures
Monitor datablock availability and status

Next Steps

→ Full Chat Completions API reference → Best practices for production deployments → Try datablocks in the interactive playground