Skip to main content

Reasoning Models

Reasoning models are optimized for complex tasks that require multi-step problem solving rather than fast, direct text generation. They are especially useful for mathematics, code analysis, planning, logical deduction, structured decision making, and tasks where the model benefits from thinking through intermediate steps before producing a final answer.

Compared with general text models, reasoning models often trade latency for stronger step-by-step problem solving and better performance on difficult instructions.

When To Use a Reasoning Model

Use a reasoning model when your task involves:

  • Multi-step math or symbolic reasoning
  • Code debugging, code generation, or code explanation
  • Complex planning and decision support
  • Long-form analysis that requires intermediate steps
  • Tool-using agents that must think between steps

For simple summarization, rewriting, or chat tasks, a standard text model is usually faster and more cost-effective.

How Reasoning Output Works

Many reasoning models return two different kinds of output:

  • content: The final answer shown to the user
  • reasoning_content: The model's reasoning trace or thinking process

In OpenAI-compatible responses, reasoning_content is typically returned alongside content in the assistant message. Some models may expose reasoning only for supported model families, and some may include parts of their reasoning directly in content instead.

Example response shape:

{
"choices": [
{
"message": {
"role": "assistant",
"content": "The answer is 925.",
"reasoning_content": "25 multiplied by 37 can be computed as..."
}
}
]
}

Basic Usage

You can call a reasoning model through the same OpenAI-compatible /chat/completions API used for text models.

curl 'https://api.hpc-ai.com/inference/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <your_token_here>' \
--data '{
"model": "minimax/minimax-m2.5",
"messages": [
{
"role": "user",
"content": "Solve this step by step: If a train travels at 60 mph for 2.5 hours, how far does it go?"
}
],
"max_tokens": 1024
}'

Read the final answer from:

choices[0].message.content

If the selected model supports exposed reasoning, you can also read:

choices[0].message.reasoning_content

Streaming Reasoning Output

Streaming is especially useful for reasoning models because their outputs may be longer and slower to complete than standard chat models.

from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.hpc-ai.com/inference/v1",
)

stream = client.chat.completions.create(
model="minimax/minimax-m2.5",
messages=[
{
"role": "user",
"content": "Solve this carefully: what is the square root of 144, and why?"
}
],
stream=True,
max_tokens=1024,
)

reasoning_parts = []
answer_parts = []

for chunk in stream:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if delta.reasoning_content:
reasoning_parts.append(delta.reasoning_content)
if delta.content:
answer_parts.append(delta.content)

print("Reasoning:", "".join(reasoning_parts))
print("Answer:", "".join(answer_parts))

When streaming, reasoning models may emit reasoning_content and final answer content in separate chunks.

Multi-Turn Conversations

Reasoning models can be used in normal multi-turn chat flows just like text models:

{
"model": "minimax/minimax-m2.5",
"messages": [
{
"role": "user",
"content": "A company has revenue of $3.2M and expenses of $2.4M. What is the profit margin?"
},
{
"role": "assistant",
"content": "The profit is $0.8M. Profit margin is profit divided by revenue, so the margin is 25%."
},
{
"role": "user",
"content": "Now explain that in plain English for a non-finance audience."
}
]
}

In standard conversational use, you usually only need to keep the visible assistant content unless a specific model or workflow requires preserving reasoning state.

Reasoning With Tool Calling

For more advanced agents, some reasoning models support interleaved thinking across tool calls. In these flows, the model may reason, call a tool, receive tool output, and continue reasoning before producing the final answer.

In model families that support this behavior, preserving prior reasoning_content across assistant turns can improve step-by-step tool use. This is especially relevant when:

  • The assistant calls tools repeatedly in one task
  • The model must reason over intermediate tool outputs
  • You are building a multi-step agent workflow

If you manually reconstruct assistant messages for the next turn, include reasoning_content when the model and SDK support it.

Example assistant message passed back into a subsequent request:

{
"role": "assistant",
"content": "Let me calculate that.",
"reasoning_content": "I should use the calculator tool for this arithmetic operation.",
"tool_calls": [
{
"id": "call_123",
"type": "function",
"function": {
"name": "calculator",
"arguments": "{\"operation\":\"add\",\"a\":15,\"b\":27}"
}
}
]
}

If the selected model does not support preserved reasoning state, including this field is typically ignored rather than harmful.

Use these controls only if the selected model explicitly supports them.

Prompting Tips for Reasoning Models

Reasoning models usually perform best when the prompt is clear and goal-oriented.

  • State the task directly and include the desired outcome
  • Ask for the final answer in a specific format when needed
  • For math or logic tasks, specify whether you want a concise answer or a worked solution
  • For coding tasks, include constraints such as language, runtime, or framework
  • Avoid unnecessary style instructions unless they matter to the task

Example:

{
"role": "user",
"content": "Analyze this Python function for time complexity and suggest a more efficient implementation."
}

For some reasoning model families, simpler prompts work better than heavily layered system instructions.

Parameter Recommendations

Reasoning models often respond better to conservative sampling settings than creative chat models.

A reasonable starting point is:

{
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 2048
}

If a model starts to repeat itself or produce unstable reasoning:

  • Lower temperature
  • Keep top_p moderate
  • Increase max_tokens if the final answer is being cut off
  • Use streaming for long generations

Context and Token Budgeting

For reasoning models, total context usage often includes:

  • User input
  • Conversation history
  • Internal reasoning tokens
  • Final answer tokens

This means long prompts plus long reasoning traces can consume context quickly.

Best practices:

  • Reserve enough room for both reasoning and final output
  • Avoid setting max_tokens equal to the full model context length
  • Trim old conversation turns when they are no longer needed
  • Be careful when storing long reasoning traces in multi-turn sessions

Best Practices

  • Use reasoning models only for tasks that benefit from deeper thinking
  • Prefer streaming for complex or long-running requests
  • Inspect reasoning_content during development to debug model behavior
  • Show only the final answer to end users unless your product explicitly needs reasoning traces
  • Validate outputs before using them in automated workflows
  • Test prompts on real tasks instead of synthetic benchmarks only

Common Issues

Output Is Truncated

This usually happens because:

  • max_tokens is too small
  • The total context is too large
  • Non-streaming requests time out on long reasoning runs

Try increasing max_tokens, shortening inputs, and enabling streaming.

The Model Is Slow

Reasoning models are usually slower than standard text models because they spend more tokens on intermediate thinking.

Try:

  • Using a smaller reasoning model
  • Reducing the reasoning budget when supported
  • Switching to a standard text model for easier tasks

The Model Produces Unstable or Repetitive Reasoning

Try:

  • Lowering temperature
  • Using a moderate top_p
  • Simplifying the prompt
  • Avoiding excessive or conflicting instructions