Chat Completions API

Generate chat completions using datablocks for efficient long-context inference.

POST /api/v1/chat/completions

Overview

The Chat Completions API allows you to generate responses using language models augmented with datablocks. Datablocks are pre-computed KV caches that efficiently store large amounts of context (documents, code repositories, conversations) in a compact form, enabling 26× faster inference while maintaining quality.

What are Datablocks? Datablocks are lightweight KV cache representations of large text corpora, trained using a self-study approach. Instead of passing thousands of tokens of context on every request, you load a datablock once and reuse it for all subsequent queries.

Request Body

ParameterTypeRequiredDescription
messagesarrayYesArray of message objects with role and content
datablocksarrayYesArray of datablock objects to load (must specify at least one)
modelstringNoModel identifier (default: "default")
max_tokensintegerNoMaximum tokens to generate (default: 256)
temperaturenumberNoSampling temperature 0.0-2.0 (default: 0.0)
streambooleanNoStream responses (default: false)

Datablock Object

FieldTypeRequiredDescription
idstringYesDatablock identifier (e.g., "username/project/run_id")
sourcestringNoSource location: "wandb", "huggingface", or "local" (default: "wandb")
force_redownloadbooleanNoForce re-downloading the datablock (default: false)

Best Practices

Reuse Datablocks

Datablocks are cached on the server. Once loaded, subsequent requests using the same datablock ID will be significantly faster.

Choose Appropriate max_tokens

Set max_tokens based on your use case. Shorter responses are faster and more cost-effective.

Multiple Datablocks

You can load multiple datablocks in a single request to combine context from different sources.

Related Documentation