Reasoning Models

Reasoning models are optimized for complex tasks that require multi-step problem solving rather than fast, direct text generation. They are especially useful for mathematics, code analysis, planning, logical deduction, structured decision making, and tasks where the model benefits from thinking through intermediate steps before producing a final answer.

Compared with general text models, reasoning models often trade latency for stronger step-by-step problem solving and better performance on difficult instructions.

When To Use a Reasoning Model

Use a reasoning model when your task involves:

Multi-step math or symbolic reasoning
Code debugging, code generation, or code explanation
Complex planning and decision support
Long-form analysis that requires intermediate steps
Tool-using agents that must think between steps

For simple summarization, rewriting, or chat tasks, a standard text model is usually faster and more cost-effective.

How Reasoning Output Works

Many reasoning models return two different kinds of output:

content: The final answer shown to the user
reasoning_content: The model's reasoning trace or thinking process

In OpenAI-compatible responses, reasoning_content is typically returned alongside content in the assistant message. Some models may expose reasoning only for supported model families, and some may include parts of their reasoning directly in content instead.

Example response shape:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The answer is 925.",
        "reasoning_content": "25 multiplied by 37 can be computed as..."
      }
    }
  ]
}

Basic Usage

You can call a reasoning model through the same OpenAI-compatible /chat/completions API used for text models.

curl 'https://api.hpc-ai.com/inference/v1/chat/completions' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <your_token_here>' \
  --data '{
    "model": "minimax/minimax-m2.5",
    "messages": [
      {
        "role": "user",
        "content": "Solve this step by step: If a train travels at 60 mph for 2.5 hours, how far does it go?"
      }
    ],
    "max_tokens": 1024
  }'

Read the final answer from:

choices[0].message.content

If the selected model supports exposed reasoning, you can also read:

choices[0].message.reasoning_content

Streaming Reasoning Output

Streaming is especially useful for reasoning models because their outputs may be longer and slower to complete than standard chat models.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.hpc-ai.com/inference/v1",
)

stream = client.chat.completions.create(
    model="minimax/minimax-m2.5",
    messages=[
        {
            "role": "user",
            "content": "Solve this carefully: what is the square root of 144, and why?"
        }
    ],
    stream=True,
    max_tokens=1024,
)

reasoning_parts = []
answer_parts = []

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    if delta.reasoning_content:
        reasoning_parts.append(delta.reasoning_content)
    if delta.content:
        answer_parts.append(delta.content)

print("Reasoning:", "".join(reasoning_parts))
print("Answer:", "".join(answer_parts))

When streaming, reasoning models may emit reasoning_content and final answer content in separate chunks.

Multi-Turn Conversations

Reasoning models can be used in normal multi-turn chat flows just like text models:

{
  "model": "minimax/minimax-m2.5",
  "messages": [
    {
      "role": "user",
      "content": "A company has revenue of $3.2M and expenses of $2.4M. What is the profit margin?"
    },
    {
      "role": "assistant",
      "content": "The profit is $0.8M. Profit margin is profit divided by revenue, so the margin is 25%."
    },
    {
      "role": "user",
      "content": "Now explain that in plain English for a non-finance audience."
    }
  ]
}

In standard conversational use, you usually only need to keep the visible assistant content unless a specific model or workflow requires preserving reasoning state.

Reasoning With Tool Calling

For more advanced agents, some reasoning models support interleaved thinking across tool calls. In these flows, the model may reason, call a tool, receive tool output, and continue reasoning before producing the final answer.

In model families that support this behavior, preserving prior reasoning_content across assistant turns can improve step-by-step tool use. This is especially relevant when:

The assistant calls tools repeatedly in one task
The model must reason over intermediate tool outputs
You are building a multi-step agent workflow

If you manually reconstruct assistant messages for the next turn, include reasoning_content when the model and SDK support it.

Example assistant message passed back into a subsequent request:

{
  "role": "assistant",
  "content": "Let me calculate that.",
  "reasoning_content": "I should use the calculator tool for this arithmetic operation.",
  "tool_calls": [
    {
      "id": "call_123",
      "type": "function",
      "function": {
        "name": "calculator",
        "arguments": "{\"operation\":\"add\",\"a\":15,\"b\":27}"
      }
    }
  ]
}

If the selected model does not support preserved reasoning state, including this field is typically ignored rather than harmful.

Use these controls only if the selected model explicitly supports them.

Prompting Tips for Reasoning Models

Reasoning models usually perform best when the prompt is clear and goal-oriented.

State the task directly and include the desired outcome
Ask for the final answer in a specific format when needed
For math or logic tasks, specify whether you want a concise answer or a worked solution
For coding tasks, include constraints such as language, runtime, or framework
Avoid unnecessary style instructions unless they matter to the task

Example:

{
  "role": "user",
  "content": "Analyze this Python function for time complexity and suggest a more efficient implementation."
}

For some reasoning model families, simpler prompts work better than heavily layered system instructions.

Parameter Recommendations

Reasoning models often respond better to conservative sampling settings than creative chat models.

A reasonable starting point is:

{
  "temperature": 0.6,
  "top_p": 0.95,
  "max_tokens": 2048
}

If a model starts to repeat itself or produce unstable reasoning:

Lower temperature
Keep top_p moderate
Increase max_tokens if the final answer is being cut off
Use streaming for long generations

Context and Token Budgeting

For reasoning models, total context usage often includes:

User input
Conversation history
Internal reasoning tokens
Final answer tokens

This means long prompts plus long reasoning traces can consume context quickly.

Best practices:

Reserve enough room for both reasoning and final output
Avoid setting max_tokens equal to the full model context length
Trim old conversation turns when they are no longer needed
Be careful when storing long reasoning traces in multi-turn sessions

Best Practices

Use reasoning models only for tasks that benefit from deeper thinking
Prefer streaming for complex or long-running requests
Inspect reasoning_content during development to debug model behavior
Show only the final answer to end users unless your product explicitly needs reasoning traces
Validate outputs before using them in automated workflows
Test prompts on real tasks instead of synthetic benchmarks only

Common Issues

Output Is Truncated

This usually happens because:

max_tokens is too small
The total context is too large
Non-streaming requests time out on long reasoning runs

Try increasing max_tokens, shortening inputs, and enabling streaming.

The Model Is Slow

Reasoning models are usually slower than standard text models because they spend more tokens on intermediate thinking.

Try:

Using a smaller reasoning model
Reducing the reasoning budget when supported
Switching to a standard text model for easier tasks

The Model Produces Unstable or Repetitive Reasoning

Try:

Lowering temperature
Using a moderate top_p
Simplifying the prompt
Avoiding excessive or conflicting instructions

When To Use a Reasoning Model​

How Reasoning Output Works​

Basic Usage​

Streaming Reasoning Output​

Multi-Turn Conversations​

Reasoning With Tool Calling​

Prompting Tips for Reasoning Models​

Parameter Recommendations​

Context and Token Budgeting​

Best Practices​

Common Issues​

Output Is Truncated​

The Model Is Slow​

The Model Produces Unstable or Repetitive Reasoning​

Related Guides​