Reasoning Models
Reasoning models are optimized for complex tasks that require multi-step problem solving rather than fast, direct text generation. They are especially useful for mathematics, code analysis, planning, logical deduction, structured decision making, and tasks where the model benefits from thinking through intermediate steps before producing a final answer.
Compared with general text models, reasoning models often trade latency for stronger step-by-step problem solving and better performance on difficult instructions.
When To Use a Reasoning Model
Use a reasoning model when your task involves:
- Multi-step math or symbolic reasoning
- Code debugging, code generation, or code explanation
- Complex planning and decision support
- Long-form analysis that requires intermediate steps
- Tool-using agents that must think between steps
For simple summarization, rewriting, or chat tasks, a standard text model is usually faster and more cost-effective.
How Reasoning Output Works
Many reasoning models return two different kinds of output:
content: The final answer shown to the userreasoning_content: The model's reasoning trace or thinking process
In OpenAI-compatible responses, reasoning_content is typically returned alongside content in the assistant message. Some models may expose reasoning only for supported model families, and some may include parts of their reasoning directly in content instead.
Example response shape:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "The answer is 925.",
"reasoning_content": "25 multiplied by 37 can be computed as..."
}
}
]
}
Basic Usage
You can call a reasoning model through the same OpenAI-compatible /chat/completions API used for text models.
curl 'https://api.hpc-ai.com/inference/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <your_token_here>' \
--data '{
"model": "minimax/minimax-m2.5",
"messages": [
{
"role": "user",
"content": "Solve this step by step: If a train travels at 60 mph for 2.5 hours, how far does it go?"
}
],
"max_tokens": 1024
}'
Read the final answer from:
choices[0].message.content
If the selected model supports exposed reasoning, you can also read:
choices[0].message.reasoning_content
Streaming Reasoning Output
Streaming is especially useful for reasoning models because their outputs may be longer and slower to complete than standard chat models.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.hpc-ai.com/inference/v1",
)
stream = client.chat.completions.create(
model="minimax/minimax-m2.5",
messages=[
{
"role": "user",
"content": "Solve this carefully: what is the square root of 144, and why?"
}
],
stream=True,
max_tokens=1024,
)
reasoning_parts = []
answer_parts = []
for chunk in stream:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if delta.reasoning_content:
reasoning_parts.append(delta.reasoning_content)
if delta.content:
answer_parts.append(delta.content)
print("Reasoning:", "".join(reasoning_parts))
print("Answer:", "".join(answer_parts))
When streaming, reasoning models may emit reasoning_content and final answer content in separate chunks.
Multi-Turn Conversations
Reasoning models can be used in normal multi-turn chat flows just like text models:
{
"model": "minimax/minimax-m2.5",
"messages": [
{
"role": "user",
"content": "A company has revenue of $3.2M and expenses of $2.4M. What is the profit margin?"
},
{
"role": "assistant",
"content": "The profit is $0.8M. Profit margin is profit divided by revenue, so the margin is 25%."
},
{
"role": "user",
"content": "Now explain that in plain English for a non-finance audience."
}
]
}
In standard conversational use, you usually only need to keep the visible assistant content unless a specific model or workflow requires preserving reasoning state.
Reasoning With Tool Calling
For more advanced agents, some reasoning models support interleaved thinking across tool calls. In these flows, the model may reason, call a tool, receive tool output, and continue reasoning before producing the final answer.
In model families that support this behavior, preserving prior reasoning_content across assistant turns can improve step-by-step tool use. This is especially relevant when:
- The assistant calls tools repeatedly in one task
- The model must reason over intermediate tool outputs
- You are building a multi-step agent workflow
If you manually reconstruct assistant messages for the next turn, include reasoning_content when the model and SDK support it.
Example assistant message passed back into a subsequent request:
{
"role": "assistant",
"content": "Let me calculate that.",
"reasoning_content": "I should use the calculator tool for this arithmetic operation.",
"tool_calls": [
{
"id": "call_123",
"type": "function",
"function": {
"name": "calculator",
"arguments": "{\"operation\":\"add\",\"a\":15,\"b\":27}"
}
}
]
}
If the selected model does not support preserved reasoning state, including this field is typically ignored rather than harmful.
Use these controls only if the selected model explicitly supports them.
Prompting Tips for Reasoning Models
Reasoning models usually perform best when the prompt is clear and goal-oriented.
- State the task directly and include the desired outcome
- Ask for the final answer in a specific format when needed
- For math or logic tasks, specify whether you want a concise answer or a worked solution
- For coding tasks, include constraints such as language, runtime, or framework
- Avoid unnecessary style instructions unless they matter to the task
Example:
{
"role": "user",
"content": "Analyze this Python function for time complexity and suggest a more efficient implementation."
}
For some reasoning model families, simpler prompts work better than heavily layered system instructions.
Parameter Recommendations
Reasoning models often respond better to conservative sampling settings than creative chat models.
A reasonable starting point is:
{
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 2048
}
If a model starts to repeat itself or produce unstable reasoning:
- Lower
temperature - Keep
top_pmoderate - Increase
max_tokensif the final answer is being cut off - Use streaming for long generations
Context and Token Budgeting
For reasoning models, total context usage often includes:
- User input
- Conversation history
- Internal reasoning tokens
- Final answer tokens
This means long prompts plus long reasoning traces can consume context quickly.
Best practices:
- Reserve enough room for both reasoning and final output
- Avoid setting
max_tokensequal to the full model context length - Trim old conversation turns when they are no longer needed
- Be careful when storing long reasoning traces in multi-turn sessions
Best Practices
- Use reasoning models only for tasks that benefit from deeper thinking
- Prefer streaming for complex or long-running requests
- Inspect
reasoning_contentduring development to debug model behavior - Show only the final answer to end users unless your product explicitly needs reasoning traces
- Validate outputs before using them in automated workflows
- Test prompts on real tasks instead of synthetic benchmarks only
Common Issues
Output Is Truncated
This usually happens because:
max_tokensis too small- The total context is too large
- Non-streaming requests time out on long reasoning runs
Try increasing max_tokens, shortening inputs, and enabling streaming.
The Model Is Slow
Reasoning models are usually slower than standard text models because they spend more tokens on intermediate thinking.
Try:
- Using a smaller reasoning model
- Reducing the reasoning budget when supported
- Switching to a standard text model for easier tasks
The Model Produces Unstable or Repetitive Reasoning
Try:
- Lowering
temperature - Using a moderate
top_p - Simplifying the prompt
- Avoiding excessive or conflicting instructions