Skip to main content

Vision-language Models

Vision-language models (VLMs) can understand both images and text in the same request. You can use them for image captioning, visual question answering, OCR-like extraction, chart and document understanding, screenshot analysis, and multi-image comparison.

For most integrations, use the OpenAI-compatible /chat/completions API. The request format is the same basic message format used for text models, but the content of a message can be an array that mixes text blocks and image blocks.

What Vision-language Models Can Do

Typical use cases include:

  • Describe or summarize the contents of an image
  • Answer questions about objects, text, layout, or relationships inside an image
  • Extract text or structured information from receipts, forms, labels, or screenshots
  • Compare multiple images in one request
  • Understand charts, tables, slides, and other visual documents
  • Combine visual input with conversation history for multi-turn workflows

Request Format

To send an image, pass a messages array to /chat/completions. In a vision request, content is usually an array of blocks:

  • A text block for instructions or questions
  • One or more image_url blocks for image inputs

Basic structure:

{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}
]
}

The url can point to:

  • A public image URL
  • A data:image/...;base64,... data URL for a local image you encoded yourself

Some vision models also support an optional detail field inside image_url, such as low, high, or auto. Use it only if your selected model supports image detail control.

Single Image Example

curl 'https://api.hpc-ai.com/inference/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <your_token_here>' \
--data '{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is happening in this image? Answer in 3 bullet points."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}
],
"max_tokens": 300
}'

Read the generated answer from:

choices[0].message.content

Base64 Image Input

If your image is stored locally or is not publicly accessible, encode it as base64 and send it as a data URL.

Python Example

import base64
import os
from openai import OpenAI

def encode_image(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")

client = OpenAI(
api_key=os.environ["INFERENCE_API_KEY"],
base_url=os.environ["INFERENCE_BASE_URL"],
)

image_base64 = encode_image("sample.jpg")

response = client.chat.completions.create(
model=os.environ["INFERENCE_MODEL"],
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract the visible text from this image."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
},
],
}
],
max_tokens=300,
)

print(response.choices[0].message.content)

JavaScript / TypeScript Example

import fs from "fs";
import OpenAI from "openai";

const client = new OpenAI({
apiKey: process.env.INFERENCE_API_KEY,
baseURL: process.env.INFERENCE_BASE_URL,
});

const imageBase64 = fs.readFileSync("sample.jpg", "base64");

const response = await client.chat.completions.create({
model: process.env.INFERENCE_MODEL!,
messages: [
{
role: "user",
content: [
{ type: "text", text: "Summarize the key information in this screenshot." },
{
type: "image_url",
image_url: {
url: `data:image/jpeg;base64,${imageBase64}`,
},
},
],
},
],
max_tokens: 300,
});

console.log(response.choices[0].message.content);

Multiple Images

You can include multiple image blocks in the same message. This is useful for comparison, before-and-after analysis, or asking the model to merge information across several screenshots or pages.

{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images and explain the main differences."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image-1.jpg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image-2.jpg"
}
}
]
}
]
}

Keep in mind:

  • Multiple images increase token usage and latency
  • Some model families handle only a small number of images well
  • For smaller or older VLMs, fewer images often produce better results

Prompting Tips

Vision quality depends heavily on the prompt. These practices usually help:

  • Tell the model exactly what to look for, such as text, objects, layout, defects, or differences
  • Ask for a specific output format, such as bullets, JSON, or a table
  • Mention whether you want OCR, summarization, classification, or comparison
  • For document images, ask for extracted fields by name
  • For multi-image prompts, label the images in your instruction, such as "first image" and "second image"

Example:

{
"role": "user",
"content": [
{
"type": "text",
"text": "Read this invoice image and return JSON with invoice_number, vendor_name, invoice_date, and total_amount."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/invoice.png"
}
}
]
}

Common Use Cases

OCR-like Extraction

Use a document image plus explicit field instructions:

{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract the product name, serial number, and price from this label. Return valid JSON."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/label.jpg"
}
}
]
}
],
"response_format": {
"type": "json_object"
}
}

Screenshot or UI Analysis

This is useful for QA workflows, design review, or support automation:

{
"model": "moonshotai/kimi-k2.5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this application screenshot and identify any visible errors, warnings, or UX issues."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/app-screen.png"
}
}
]
}
]
}

Token Usage and Cost

Images are converted into visual tokens before the model answers. That means image inputs contribute to usage and billing, just like text tokens.

Actual token usage depends on the selected model and, in some cases:

  • Image resolution
  • Number of images
  • Whether the model supports detail control
  • How the provider preprocesses images internally

If your requests become expensive or slow:

  • Use fewer images
  • Resize very large images before upload
  • Crop to the relevant region when possible
  • Use lower-detail image settings if your chosen model supports them

Limitations and Best Practices

  • Use a model that explicitly supports image input. Not all chat models are vision-capable.
  • Check the current model catalog before hard-coding a model name, because the available VLM list may change over time.
  • Public image URLs should be directly accessible by the inference service.
  • For local files, base64 data URLs are often the most reliable option.
  • Very large images, too many images, or low-quality screenshots can reduce accuracy.
  • Some models perform better on natural images, while others are stronger at documents, OCR, or charts.
  • If you need strict machine-readable output, combine vision input with response_format.

Choosing a Vision Model

When selecting a model, consider:

  • Visual understanding quality
  • OCR and document parsing performance
  • Multi-image support
  • Latency and cost
  • Maximum context and image handling limits

If you are unsure which model to use, start by checking the current available models in your platform catalog and choose one that explicitly supports visual input.