Prerequisites
- SDK installed and configured (see Authentication)
- A data plane ID where inference will run
- An API key with appropriate permissions
Basic inference request
Submit a simple inference request with a system prompt and user message:Tracking job completion
Inference jobs are asynchronous. Poll for completion using the job ID:Configuring inference parameters
Fine-tune the model’s behavior with configuration options:| Parameter | Effect | Typical Values |
|---|---|---|
temperature | Controls randomness. Lower = more deterministic | 0.0-0.3 for factual, 0.7-1.0 for creative |
max_tokens | Limits response length | 100-4000 depending on task |
top_p | Nucleus sampling threshold | 0.9-1.0 for most cases |
stop_sequences | Strings that end generation | ["\n\n", "END"] |
Multi-turn conversations
Include previous messages for context-aware responses:Choosing a model
Select the model based on your task requirements:| Model | Best For |
|---|---|
anthropic.claude-haiku-4.5 | Fast, simple tasks (classification, extraction) |
anthropic.claude-sonnet-4.5 | Balanced tasks (summarization, analysis) |
anthropic.claude-opus-4.5 | Complex reasoning (multi-step analysis, nuanced decisions) |
openai.gpt-4.1 | Advanced reasoning tasks |
openai.o4-mini | Fast, cost-effective tasks |
Typing responses
Use TypeScript generics to type the structured output:Error handling
Handle common inference errors:Best practices
| Practice | Description |
|---|---|
| Use specific schemas | Define precise JSON Schema to get consistent outputs |
| Choose appropriate models | Use smaller models for simple tasks to save cost and time |
| Set reasonable max_tokens | Avoid unnecessarily large values that increase latency |
| Include system prompts | Guide model behavior with clear instructions |
| Handle failures gracefully | Implement retries for transient errors |

