Vision-language Models
Vision-language models (VLMs) can understand both images and text in the same request. You can use them for image captioning, visual question answering, OCR-like extraction, chart and document understanding, screenshot analysis, and multi-image comparison.
For most integrations, use the OpenAI-compatible /chat/completions API. The request format is the same basic message format used for text models, but the content of a message can be an array that mixes text blocks and image blocks.
What Vision-language Models Can Do
Typical use cases include:
- Describe or summarize the contents of an image
- Answer questions about objects, text, layout, or relationships inside an image
- Extract text or structured information from receipts, forms, labels, or screenshots
- Compare multiple images in one request
- Understand charts, tables, slides, and other visual documents
- Combine visual input with conversation history for multi-turn workflows
Request Format
To send an image, pass a messages array to /chat/completions. In a vision request, content is usually an array of blocks:
- A
textblock for instructions or questions - One or more
image_urlblocks for image inputs
Basic structure:
{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}
]
}
The url can point to:
- A public image URL
- A
data:image/...;base64,...data URL for a local image you encoded yourself
Some vision models also support an optional detail field inside image_url, such as low, high, or auto. Use it only if your selected model supports image detail control.
Single Image Example
curl 'https://api.hpc-ai.com/inference/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <your_token_here>' \
--data '{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is happening in this image? Answer in 3 bullet points."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}
],
"max_tokens": 300
}'
Read the generated answer from:
choices[0].message.content
Base64 Image Input
If your image is stored locally or is not publicly accessible, encode it as base64 and send it as a data URL.
Python Example
import base64
import os
from openai import OpenAI
def encode_image(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
client = OpenAI(
api_key=os.environ["INFERENCE_API_KEY"],
base_url=os.environ["INFERENCE_BASE_URL"],
)
image_base64 = encode_image("sample.jpg")
response = client.chat.completions.create(
model=os.environ["INFERENCE_MODEL"],
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract the visible text from this image."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0].message.content)
JavaScript / TypeScript Example
import fs from "fs";
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.INFERENCE_API_KEY,
baseURL: process.env.INFERENCE_BASE_URL,
});
const imageBase64 = fs.readFileSync("sample.jpg", "base64");
const response = await client.chat.completions.create({
model: process.env.INFERENCE_MODEL!,
messages: [
{
role: "user",
content: [
{ type: "text", text: "Summarize the key information in this screenshot." },
{
type: "image_url",
image_url: {
url: `data:image/jpeg;base64,${imageBase64}`,
},
},
],
},
],
max_tokens: 300,
});
console.log(response.choices[0].message.content);
Multiple Images
You can include multiple image blocks in the same message. This is useful for comparison, before-and-after analysis, or asking the model to merge information across several screenshots or pages.
{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images and explain the main differences."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image-1.jpg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image-2.jpg"
}
}
]
}
]
}
Keep in mind:
- Multiple images increase token usage and latency
- Some model families handle only a small number of images well
- For smaller or older VLMs, fewer images often produce better results
Prompting Tips
Vision quality depends heavily on the prompt. These practices usually help:
- Tell the model exactly what to look for, such as text, objects, layout, defects, or differences
- Ask for a specific output format, such as bullets, JSON, or a table
- Mention whether you want OCR, summarization, classification, or comparison
- For document images, ask for extracted fields by name
- For multi-image prompts, label the images in your instruction, such as "first image" and "second image"
Example:
{
"role": "user",
"content": [
{
"type": "text",
"text": "Read this invoice image and return JSON with invoice_number, vendor_name, invoice_date, and total_amount."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/invoice.png"
}
}
]
}
Common Use Cases
OCR-like Extraction
Use a document image plus explicit field instructions:
{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract the product name, serial number, and price from this label. Return valid JSON."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/label.jpg"
}
}
]
}
],
"response_format": {
"type": "json_object"
}
}
Screenshot or UI Analysis
This is useful for QA workflows, design review, or support automation:
{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this application screenshot and identify any visible errors, warnings, or UX issues."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/app-screen.png"
}
}
]
}
]
}
Token Usage and Cost
Images are converted into visual tokens before the model answers. That means image inputs contribute to usage and billing, just like text tokens.
Actual token usage depends on the selected model and, in some cases:
- Image resolution
- Number of images
- Whether the model supports
detailcontrol - How the provider preprocesses images internally
If your requests become expensive or slow:
- Use fewer images
- Resize very large images before upload
- Crop to the relevant region when possible
- Use lower-detail image settings if your chosen model supports them
Limitations and Best Practices
- Use a model that explicitly supports image input. Not all chat models are vision-capable.
- Check the current model catalog before hard-coding a model name, because the available VLM list may change over time.
- Public image URLs should be directly accessible by the inference service.
- For local files, base64 data URLs are often the most reliable option.
- Very large images, too many images, or low-quality screenshots can reduce accuracy.
- Some models perform better on natural images, while others are stronger at documents, OCR, or charts.
- If you need strict machine-readable output, combine vision input with
response_format.
Choosing a Vision Model
When selecting a model, consider:
- Visual understanding quality
- OCR and document parsing performance
- Multi-image support
- Latency and cost
- Maximum context and image handling limits
If you are unsure which model to use, start by checking the current available models in your platform catalog and choose one that explicitly supports visual input.