Chat Completions API
Generate chat completions using datablocks for efficient long-context inference.
Overview
The Chat Completions API allows you to generate responses using language models augmented with datablocks. Datablocks are pre-computed KV caches that efficiently store large amounts of context (documents, code repositories, conversations) in a compact form, enabling 26× faster inference while maintaining quality.
What are Datablocks? Datablocks are lightweight KV cache representations of large text corpora, trained using a self-study approach. Instead of passing thousands of tokens of context on every request, you load a datablock once and reuse it for all subsequent queries.
Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| messages | array | Yes | Array of message objects with role and content |
| datablocks | array | Yes | Array of datablock objects to load (must specify at least one) |
| model | string | No | Model identifier (default: "default") |
| max_tokens | integer | No | Maximum tokens to generate (default: 256) |
| temperature | number | No | Sampling temperature 0.0-2.0 (default: 0.0) |
| stream | boolean | No | Stream responses (default: false) |
Datablock Object
| Field | Type | Required | Description |
|---|---|---|---|
| id | string | Yes | Datablock identifier (e.g., "username/project/run_id") |
| source | string | No | Source location: "wandb", "huggingface", or "local" (default: "wandb") |
| force_redownload | boolean | No | Force re-downloading the datablock (default: false) |
Best Practices
Reuse Datablocks
Datablocks are cached on the server. Once loaded, subsequent requests using the same datablock ID will be significantly faster.
Choose Appropriate max_tokens
Set max_tokens based on your use case. Shorter responses are faster and more cost-effective.
Multiple Datablocks
You can load multiple datablocks in a single request to combine context from different sources.