Prerequisites
- SDK installed and configured (see Authentication)
- A data plane ID where inference will run
- An API key with appropriate permissions
Basic inference request
Submit a simple inference request with a system prompt and user message:Tracking job completion
Inference jobs are asynchronous. Poll for completion using the job ID:Configuring inference parameters
Fine-tune the model’s behavior with configuration options:| Parameter | Effect | Typical Values |
|---|---|---|
temperature | Controls randomness. Lower = more deterministic | 0.0-0.3 for factual, 0.7-1.0 for creative |
max_tokens | Limits response length | 100-4000 depending on task |
top_p | Nucleus sampling threshold | 0.9-1.0 for most cases |
stop_sequences | Strings that end generation | ["\n\n", "END"] |
Multi-turn conversations
Include previous messages for context-aware responses:Tool-use messages (agent loops)
In agent-style flows, the model emitstool_use content blocks requesting a tool
call and you reply with a matching tool_result block. Both share the same
tool_use_id so the model can correlate the request and response across turns.
The legacy
{ role, text: string } request shape is still accepted for backwards
compatibility — the API auto-canonicalizes it into a single text content block.
Responses always emit the content-block shape. New integrations should use content
blocks directly. See Model Inference API reference.Choosing a model
Select the model based on your task requirements:| Model | Best For |
|---|---|
anthropic.claude-haiku-4.5 | Fast, simple tasks (classification, extraction) |
anthropic.claude-sonnet-4.5 | Balanced tasks (summarization, analysis) |
anthropic.claude-sonnet-4.6 | Latest balanced model with improved reasoning |
anthropic.claude-opus-4.5 | Complex reasoning (multi-step analysis, nuanced decisions) |
anthropic.claude-opus-4.6 | Latest most capable model for the most demanding tasks |
openai.gpt-4.1 | Advanced reasoning tasks |
openai.o4-mini | Fast, cost-effective tasks |
Typing responses
Use TypeScript generics to type the structured output:Error handling
Handle common inference errors:Best practices
| Practice | Description |
|---|---|
| Use specific schemas | Define precise JSON Schema to get consistent outputs |
| Choose appropriate models | Use smaller models for simple tasks to save cost and time |
| Set reasonable max_tokens | Avoid unnecessarily large values that increase latency |
| Include system prompts | Guide model behavior with clear instructions |
| Handle failures gracefully | Implement retries for transient errors |
Related content
Structured Output Guide
Deep dive into JSON Schema for inference
Choosing Models
Select the right model for your task
Model Inference API
Complete API reference
Tracking Jobs
Monitor job status and handle completion

