Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Seeing with GPT‑4o: Building with OpenAI’s Vision Capabilities

New
14 min read
Raymond F
Raymond F
Published January 12, 2026
Seeing with GPT‑4o: Building with OpenAI’s Vision Capabilities cover image

Over the last few years, developers have gone from using language models for text-only chat to relying on them as general-purpose perception systems. You're not only building chatbots; you're building apps that use text, audio, and vision to understand and act on the world around them.

GPT-4o is the most capable step yet: a single model that can read an image, understand layout and structure, extract text, compare visuals, and reason about what it sees, all in the same conversation where it interprets your instructions.

This unlocks a huge range of tasks that previously required custom computer-vision tooling: reading dashboards, parsing PDFs, analyzing charts, understanding UI state, or answering questions about a screenshot. The challenge isn't just sending an image to GPT-4o; it's knowing how to pair visuals with text instructions, how to structure prompts for consistent output, and how to validate what the model "saw."

This guide walks through those best practices and shows how to build reliable, real-world workflows using GPT-4o Vision.

How GPT‑4o Vision Works

GPT-4o was OpenAI's first flagship model trained end-to-end across text, images, audio, and video, using a unified multimodal architecture. Earlier generations relied on a separate vision encoder that fed into a language model.

GPT-4o works differently: it processes every modality (pixels, text tokens, waveforms) through the exact attention mechanisms. This architectural shift lets the model interpret visual information using the same reasoning stack it uses for language, making perception far more reliable and context-aware.

For developers, the practical upshot is that any image you provide is native context. Screenshots, photos, diagrams, charts, scanned documents: GPT-4o treats them all as first-class input. The model "reads" an image by extracting salient features (objects, layout, text, structure, spatial relationships) and fuses those representations with your text instructions.

Example of how ChatGPT 4o reads an image file to extract salient features

The result is a single context window where the model can understand, reference, and reason about visual and textual elements simultaneously.

Understanding the Shared Budget of Multimodal Context Windows

GPT-4o supports large context windows (commonly up to 128K tokens), but this budget is shared across all modalities. Images consume part of it, and the cost depends on resolution and visual complexity. High-resolution images use more tokens; downsampled or low-detail images use fewer. Multiple images add up quickly.

You can control this tradeoff with the detail parameter, which accepts low, high, or auto.

  • Setting detail to low processes the image at 512×512 and costs a fixed 85 tokens, regardless of the original resolution. This is useful when you don't need fine-grained detail (e.g., identifying dominant colors or shapes).

  • For high detail mode, the token cost scales with image size. The model divides the image into 512×512 tiles after scaling the shortest side to 768px, then charges 170 tokens per tile plus a base cost of 85 tokens. A 1024×1024 image in high mode costs roughly 765 tokens. A 2048×4096 image costs around 1,105 tokens. These numbers matter when you're processing many images or working near the limits of context.

You can use some of these formulae to calculate the costs.

Because context is shared, GPT-4o can engage in persistent multimodal reasoning. You can refer back to previous images ("zoom into the label on the door"), pair text instructions with visual input ("analyze this table using the criteria below"), or compare several images in one request ("highlight the UI differences between A and B"). This continuity enables real workflows such as UI debugging, multi-step chart interpretation, and multi-image document extraction.

When Should You Use Vision vs. Text-Only Models?

Use GPT-4o Vision when the information you need is actually in the pixels:

  • Screenshots and UI state analysis (buttons, modals, navigation, errors)

  • Charts, plots, tables, dashboards

  • Documents where layout matters (invoices, PDFs, forms)

  • Real-world scenes (objects, signage, labels, spatial relations)

  • Image comparison and diffing

  • OCR and structured text extraction

If the task relies purely on structured or textual input (parsing JSON, writing SQL, generating code), text-only models are typically more cost-efficient.

That said, vision has known limitations. The model struggles with medical images, non-Latin text (Japanese, Korean, etc.), small or rotated text, precise spatial reasoning (like chess positions), and accurate object counting. Panoramic and fisheye images also cause problems. If your use case falls into one of these categories, you'll likely need specialized tooling or additional validation.

How Does GPT-4o Process Visual Input?

The underlying architecture is complex, but the high-level pipeline is conceptually straightforward:

  1. Visual Encoding. The image is converted into internal feature representations that capture objects, text, spatial layout, and semantic cues.

  2. Multimodal Fusion. These visual features are merged with text tokens into a single shared attention space. This is the core innovation: the model attends to pixels and words using the same transformer layers.

  3. Autoregressive Output. GPT-4o generates text (or other outputs) conditioned on the fused context. Because the model reasons jointly across all modalities, it can cite what it "saw," follow instructions that reference specific regions, and combine visual cues with logic or domain knowledge.

GPT-4o accepts PNG, JPEG, WEBP, and non-animated GIF formats, with a 50MB total payload limit and a cap of 500 images per request.

This pipeline is what allows GPT-4o to move beyond traditional OCR or computer vision tasks and perform true visual reasoning, from interpreting a UI flow to analyzing the shape of a data trend in a chart.

Working with GPT-4o and Images

Now that we've covered how GPT-4o processes visual input, let's look at how to actually use it. The API is straightforward: you send an image alongside your text prompt, and the model responds based on both.

To get started, grab an OpenAI API key and add it to your environment:

shell
1
export OPENAI_API_KEY="<your OpenAI API key>"

You can provide images to GPT-4o in three ways:

  • URL: A direct link to a publicly accessible image

  • Base64: The image encoded as a data URL string

  • File ID: A reference to a file you've uploaded via OpenAI's Files API

You can include multiple images in a single request by adding them to the content array, but remember that each image consumes tokens and adds to your bill.

Pairing Images with Instructions

The order and structure of your content array matters. A few guidelines:

  • Put instructions before the image. The model processes content sequentially. Placing your text prompt first primes it for what to look for, which generally improves extraction accuracy.

  • Be explicit about image references. If you're sending multiple images, label them ("Image A shows the before state, Image B shows after") and reference those labels in your instructions.

  • Keep instructions close to the image they reference. In multi-turn conversations, don't assume the model will connect a prompt in message 3 with an image from message 1. Repeat or summarize context when needed.

Extracting Structured Data from a Screenshot

Let's walk through a practical example: extracting pricing information from a screenshot. We'll use an image of Stream's pricing page:

Stream Chat pricing landing page

The data we'll request is a structured JSON of what is in this image. Here's all the code to get this information:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import OpenAI from "openai"; import * as fs from "fs"; const openai = new OpenAI(); async function analyzeImage(imagePath) { const imageBuffer = fs.readFileSync(imagePath); const base64Image = imageBuffer.toString("base64"); const response = await openai.chat.completions.create({ model: "gpt-4o", messages: [ { role: "user", content: [ { type: "text", text: `Extract pricing information from this page. Respond with JSON: { "product_name": "the product being priced", "company": "company name", "mau_options": ["available MAU tiers from slider"], "tiers": [ { "name": "tier name", "tagline": "short description", "price": "price or FREE", "billing_period": "annual/monthly/custom", "monthly_price": "monthly equivalent if shown", "mau": "included MAU", "concurrent_connections": 0, "features": ["list of features"], "cta": "call to action button text" } ] }`, }, { type: "image_url", image_url: { url: `data:image/png;base64,${base64Image}`, }, }, ], }, ], response_format: { type: "json_object" }, }); return JSON.parse(response.choices[0].message.content); } const analysis = await analyzeImage("./screen.png"); console.log(JSON.stringify(analysis, null, 2));

A few things to note about this code:

  1. Image encoding. We read the image from disk and convert it to a Base64 string. The image_url field accepts this as a data URL with the appropriate MIME type (data:image/png;base64,...).

  2. Mixed content array. The content field contains both a text object (our extraction instructions and target schema) and an image object. GPT-4o processes these together in a single context.

  3. JSON mode. Setting response_format: { type: "json_object" } tells the model to return valid JSON. Combined with the schema we provided in the prompt, this gives us structured, parseable output.

  4. Schema as prompt. We're not using a formal schema validation layer here. Instead, we embed the expected JSON structure directly in the prompt. The model follows this template and fills in the values based on what it sees in the image.

Running this against Stream's pricing page produces:

Integrate LLMs fast! Our UI components are perfect for any AI chatbot interface right out of the box. Try them today and launch tomorrow!
json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
{ "product_name": "Chat", "company": "Stream", "mau_options": [ "10K MAU", "25K MAU", "50K MAU" ], "tiers": [ { "name": "Build", "tagline": "Start building for free", "price": "FREE", "billing_period": "custom", "monthly_price": null, "mau": "1,000 MAU", "concurrent_connections": 100, "features": [ "No Credit Card Required", "Community Support", "30 Days of Free Support" ], "cta": "Start Coding" }, { "name": "Start", "tagline": "Robust Chat features", "price": "$399.00", "billing_period": "annual", "monthly_price": "$499.00 monthly", "mau": "10,000 MAU", "concurrent_connections": 500, "features": [ "Advanced Moderation & Filters", "2 Billion Records", "Global EDGE Network" ], "cta": "Start Coding" }, { "name": "Elevate", "tagline": "Start Chat features plus more", "price": "$599.00", "billing_period": "annual", "monthly_price": "$675.00 monthly", "mau": "10,000 MAU", "concurrent_connections": 500, "features": [ "Multi Tenancy/Teams", "Advanced Search", "HIPAA" ], "cta": "Start Coding" }, { "name": "Enterprise", "tagline": "Enterprise-grade service, bigger annual discounts", "price": "Contact Us", "billing_period": "custom", "monthly_price": null, "mau": "Scale to millions of users", "concurrent_connections": 0, "features": [ "AI Moderation", "99.999% SLA", "Dedicated Servers" ], "cta": "Contact Us" } ] }

The model correctly identified all four pricing tiers, extracted the feature lists, parsed the MAU slider options, and even captured details like the annual vs. monthly price distinction.

This is the kind of task that would traditionally require custom scraping logic or a dedicated OCR pipeline. With GPT-4o, you define the output structure you want, and the model handles the visual parsing.

Designing Prompts for Structured Output

Getting consistent, structured output from GPT-4o requires more than just asking for JSON or Markdown. You need to constrain the model's behavior with explicit instructions. A well-designed, structured prompt typically includes:

  1. Role and task. Tell the model what persona to adopt and what it's doing.

  2. Output schema. Specify the exact format (JSON schema, Markdown template, table headers).

  3. Inclusion/exclusion rules. Define what belongs in the output and what doesn't.

  4. Examples. Show the model what a correct response looks like.

  5. Self-validation. Ask the model to check its work before returning the final output.

You won't always need all five, but the more structured your target output, the more constraints help.

Template 1: Chart Interpretation

When you need the model to analyze a chart and return findings in a predictable format, give it a rigid Markdown template:

You  are  a  data  analyst.  Analyze  the  provided  chart.

Output  only  in  the  following  Markdown  template.  Do  not  add  extra  sections.

##  Template

1.  Chart  type:  <one  of:  line,  bar,  stacked  bar,  scatter,  pie,  area,  heatmap,  table,  other>

2.  Top  3  key  values

3.  Trends/patterns:  3-5  bullet  points,  each  must  reference  a  specific  axis  value

4.  Anomalies:  <bullets  or  "None  visible">

5.  Conclusion:  <2-3  sentence  summary>

##  Rules

-  If  a  label  is  unreadable,  write  "unclear"  instead  of  guessing.

-  No  external  knowledge;  use  only  what's  visible  in  the  chart.

##  Before  finalizing,  check:

-  Every  trend  mentions  numbers  or  axis  locations.

-  No  extra  text  outside  the  template.

The self-check at the end catches standard failure modes: vague trend descriptions and extraneous commentary.

Template 2: UI Screenshot Review

For structured data you'll process programmatically, JSON is cleaner than Markdown. This template asks the model to audit a UI and return issues in a parseable format:

You  are  a  product  UX  reviewer.  Inspect  the  UI  screenshot.

Return  only  JSON  matching  this  schema:

{
  "screen_purpose":  "",
  "main_elements":  [
    {
      "type":  "button|text|input|nav|card|table|chart|icon|other",
      "label":  "",
      "location":  "top-left|top|top-right|left|center|right|bottom-left|bottom|bottom-right"
    }
  ],
  "issues":  [
    {
      "severity":  "low|medium|high",
      "category":  "layout|copy|accessibility|consistency|interaction|performance|other",
      "description":  "",
      "evidence":  ""
    }
  ],
  "suggested_improvements":  [
    {
      "priority":  "p0|p1|p2",
      "change":  "",
      "expected_impact":  ""
    }
  ]
}

Rules:
-  "evidence"  must  reference  a  specific  element  and  its  location.
-  If  uncertain  about  something,  use  "unknown"  rather  than  guessing.
-  Output  valid  JSON  only.  No  markdown  fences,  no  comments.

The enum-style constraints (severity, category, priority) make downstream processing easier since you know exactly what values to expect.

Template 3: Multi-Image Comparison

When comparing two images (UI versions, chart snapshots, design iterations), a table format works well because it forces parallel structure (ChatGPT did this in the example above. This is probably embedded in its instructions):

Compare  Image  A  and  Image  B.

Output  a  Markdown  table  with  these  columns:  Area,  A,  B,  Impact,  Confidence.

Rules:

-  Area  must  be  one  of:  layout,  color/typography,  content/data,  interaction,  performance,  other.
-  Impact  must  be:  none,  low,  medium,  high.
-  Confidence  must  be:  low,  medium,  high.
-  One  row  per  difference.  If  no  differences  in  an  area,  omit  the  row.

This works well for UI diffing, A/B test analysis, or tracking changes across chart versions.

Template 4: Metric Extraction

When extracting tabular data from an image (dashboards, reports, spec sheets), define the exact table header you want:

Extract  metrics  from  this  image.

Output  only  a  Markdown  table  with  this  exact  header:

|  Metric  |  Value  |  Unit  |  Evidence  |  Notes  |

Rules:

-  One  metric  per  row.
-  Evidence  must  quote  or  point  to  where  the  value  appears  in  the  image.
-  If  a  field  is  missing,  put  "N/A".
-  No  extra  rows  beyond  the  metrics  you  find.

Markdown tables are useful when you need consistent fields across many items, or when the output will be copied into docs or spreadsheets. The "Evidence" column forces the model to ground its extractions in what it actually saw, which reduces hallucination.

How to Debug Visual Reasoning

GPT-4o will sometimes get things wrong. Text gets misread, objects get miscounted, and spatial relationships get confused. When this happens, you need a systematic way to diagnose and fix the problem.

Common failure modes include:

  • Hallucinated details. The model invents text, numbers, or UI elements that aren't in the image. This is especially common with small or low-contrast text.

  • Misread text. Characters get swapped, especially in stylized fonts or low-resolution images. "S" becomes "5," "I" becomes "l."

  • Counting errors. The model gives approximate counts rather than exact ones. If you need precision, don't trust it.

  • Spatial confusion. "Left" and "right" get mixed up, or the model misidentifies which label belongs to which element.

If you're butting against these issues, some diagnostic options are:

  1. Force citations. Add a rule like "For every claim, cite where in the image you see it." This surfaces cases where the model is guessing rather than reading.

  2. Two-pass extraction. In pass one, ask the model only to list what it sees (text, objects, values) without interpretation. In pass two, ask it to reason over that extracted data. This separates perception errors from reasoning errors.

  3. Zoom in on failures. If a specific region is causing problems, crop the image to just that area and resubmit. Fewer distractions often improve accuracy.

  4. Toggle resolution. Try detail: high if you're on low, or vice versa. Sometimes the model performs better with more (or less) visual information.

  5. Rephrase the question. A different framing can produce different results. "What text appears in the header?" may work better than "Read the header."

Stop iterating when you've tried multiple prompt variations and resolution settings, and the model still fails consistently; you've likely hit a fundamental limitation. Medical images, non-Latin scripts, precise spatial tasks, and fine-grained counting are known weak spots. At that point, consider specialized tooling (e.g., dedicated OCR or object detection models) or human review.

Frequently Asked Questions

How can I prompt GPT‑4o to return structured JSON reliably

GPT-4o can produce structured JSON in two ways: via API enforcement or prompt-only. Prefer API-structured outputs or function calls, and when you can't enforce a schema, use a prompt to guide the model to return a specific JSON output.

What are GPT‑4o Vision's limitations with complex or abstract images?

The model is great at semantic understanding ("What is here?") but still weaker on geometric/low-level tasks ("Exactly where is the cat? And how many of them are there?"). Also, it can interpret abstract images in ways that differ from what is actually in the image.

How should multiple images be handled in a single prompt?

Via API: send multiple image_url parts in a single message (or across multiple messages) and refer to them by label. For example, "you will see image A, image B... summarize each image separately."

How do you optimize performance and cost for GPT‑4o Vision tasks?

Vision cost, like text cost, is token-based, and images consume tokens. The primary token usage is the image size. A smaller image means fewer tokens used, but giving a model an image that is so downsampled it loses details creates an issue where the model would not be able to understand what is in the image. You can start with a less detailed image, see how GPT-4o performs, and gradually increase the resolution until it produces the desired output. Give a tight instruction like "focus only on X" and use structured outputs to avoid weaving from the actual thing you asked.

How do you verify the accuracy of visual reasoning results?

Force the model to cite where it "saw" each of its claims. Perform two-pass checks: pass one is an extraction-only pass where the model extracts text, objects, and values, and pass two reasons over the data from pass one. You can require a confidence score, though this can be hallucinated. Lastly, you can manually review the output and re-ask with different phrasing as you go.

Where GPT-4o Vision Fits

Understanding the architecture, API mechanics, and prompt design for reliable structured output will allow you to use GPT-4o vision across use-cases:

  • UI state analysis. The model can see what's on screen, identify the current app state, and describe the available actions. This is useful for automated testing, accessibility audits, and building agents that interact with interfaces.

  • Text extraction from images and documents. GPT-4o handles OCR-like reading, but goes further by reasoning over layout. It understands that a heading relates to the paragraph below it, or that a label belongs to a specific form field.

  • Chart interpretation. The model can extract numbers, identify trends, and summarize insights from visualizations. Combined with structured output prompts, this turns chart images into usable data.

  • Image comparison. GPT-4o can compare two or more images with spatial awareness, making it functional for UI diffing or tracking visual changes over time. That said, it struggles with fine-grained detail, so don't rely on it for pixel-level precision.

The broader pattern here is that GPT-4o collapses what used to require multiple specialized tools (OCR engines, object detection models, layout parsers) into a single model you can prompt in natural language. The tradeoff is that you're working with a generalist, not a specialist. For many workflows, that's a good deal. For others, you'll still need dedicated tooling.

The best way to figure out where the line is? Build something and see where it breaks.

Ready to Increase App Engagement?
Integrate Stream’s real-time communication components today and watch your engagement rate grow overnight!
Contact Us Today!