Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub

Developer’s Guide to Building Vision AI Pipelines Using Grok

New
14 min read
Raymond F
Raymond F
Published March 13, 2026
Developer’s Guide to Building Vision AI Pipelines Using Grok cover image

Grok tends to fly under the radar. While ChatGPT, Claude, and Gemini have found their footing in enterprise workflows and agentic toolchains, Grok remains mostly associated with X, which has overshadowed some genuinely strong capabilities.

Chief among them is vision: Grok can understand and generate images, produce entire videos from a single prompt, and with competitive API pricing, you can build real vision AI pipelines around it.

That’s what we’re going to do here. We'll build a full-stack application that uses all three of Grok's vision capabilities: image understanding, image generation, and video generation. Then we'll chain them together into a real-time pipeline using Vision Agents that watches a live camera feed, describes what it sees, and generates artistic interpretations on the fly.

Let’s Grok.

How Grok's Vision Stack Works

Most AI image generators, including DALL-E 3, Stable Diffusion, Midjourney, and Google's Imagen, are built on diffusion models. They start with noise and iteratively refine it into an image. Aurora, xAI's image generation model behind Grok, takes a different approach entirely.

Aurora is an autoregressive mixture-of-experts network trained to predict the next token from interleaved text and image data. In practical terms, it generates images the same way an LLM generates text: one token at a time, left to right, conditioned on everything that came before. Images are tokenized into discrete tokens that live in the same stream as text tokens, so when you ask Aurora to generate an image, it's doing the same fundamental operation as when Grok answers a question, just predicting the next token in a sequence that happens to decode into pixels instead of words.

This matters for two reasons.

  1. It makes image editing native. Because the model can condition on a mixed sequence of text and image tokens, you can pass in an image and a text instruction ("remove the background," "change the color of the shirt"), and the model generates a modified image by continuing the token sequence. Diffusion models need separate inpainting or img2img pipelines to do this.
  2. It means the model benefits from the same scaling laws that have driven LLM progress. xAI's MoE bet is relevant here: Grok-1 was released as a 314B parameter mixture-of-experts model with 25% of weights active per token. Aurora carries the same architectural philosophy into the visual domain, using a massive model but activating only a fraction of it per token.

Grok's vision capabilities span three distinct subsystems: multimodal understanding (image in, text out), image generation via Aurora (text in, image out), and video generation (text or image in, video out). They're different models with different interaction patterns, but as we'll see in the examples below, they compose naturally into a single pipeline where understanding feeds generation, which feeds video.

With that context, let's build.

Image Understanding with Grok Vision

Many Grok models are capable of image understanding. The best dedicated model (at time of writing) is grok-2-vision-1212. This isn’t the latest model from X, but most Grok models are broad reasoning models that can understand vision, but aren’t specialized as Grok-2 Vision.

The vision API follows the same conversational structure as text completions: you send a message with an image and a text prompt, and Grok responds with its analysis:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
const client = new OpenAI({ apiKey: process.env.XAI_API_KEY, baseURL: "https://api.x.ai/v1", }); const response = await client.responses.create({ model: "grok-2-vision-1212", input: [ { role: "user", content: [ { type: "input_image", image_url: imageInput, // URL or base64 data URI detail: "high", }, { type: "input_text", text: prompt, }, ], }, ], });

A few things to notice. First, xAI's API is OpenAI-compatible, so you can use the standard OpenAI SDK by just changing the baseURL to https://api.x.ai/v1. This makes migration trivial if you're already using GPT-4 Vision. Second, the detail: "high" parameter tells Grok to use more tokens to analyze the image, which matters for detailed screenshots or documents with small text.

The image input can be either a public URL or a base64-encoded data URI. The route handles both:

typescript
1
2
3
4
5
6
7
8
9
10
if (contentType.includes("multipart/form-data")) { const formData = await req.formData(); const file = formData.get("image") as File; const buffer = Buffer.from(await file.arrayBuffer()); const base64 = buffer.toString("base64"); imageInput = `data:${file.type};base64,${base64}`; } else { const body = await req.json(); imageInput = body.imageUrl; }

To test it, we'll feed Grok a product marketing image from the Stream website. This image is a good test because it contains multiple visual elements: a mobile app mockup, a code editor, UI components, and text at various sizes.

Product marketing image from Stream's website

Here's what Grok returns:

Grok's response to a product marketing image from Stream's website

The analysis is thorough. Grok correctly identifies the two main sections (smartphone screen and code editor), reads the UI text ("KirkLe", "For You", "Following", "Discover"), extracts the full code snippet including variable names and function calls, and even notes the background gradient styling. It captures details such as the "+12" overlay on the image grid, the 120 replies and 1,953 likes counts, and the hashtags in the second post.

This level of detail makes Grok Vision ideal for:

  • Extracting structured data from screenshots, dashboards, and charts without manual transcription
  • Auditing UI implementations by comparing what's rendered against what's expected
  • Processing documents, receipts, and forms where OCR alone would miss layout context
  • Feeding into downstream pipelines, where the text output becomes input for another model (which is exactly what we'll do with the Scene Narrator)

The key takeaway here is that Grok Vision doesn't just caption images. It reads them structurally, distinguishing between UI regions, extracting code verbatim, and understanding spatial relationships between elements. That's what makes it useful as the first stage in a larger vision AI pipeline.

Image Generation with Grok

Image generation with Grok works through a different model, grok-imagine-image, and a different API structure.

The API is straightforward:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
const response = await fetch("https://api.x.ai/v1/images/generations", { method: "POST", headers: { "Content-Type": "application/json", Authorization: `Bearer ${process.env.XAI_API_KEY}`, }, body: JSON.stringify({ model: "grok-imagine-image", prompt, n: Math.min(Math.max(n || 1, 1), 4), // 1-4 images aspect_ratio: "1:1", // also supports 16:9, 9:16, 4:3, 3:4, 3:2, 2:3 }), });

You can generate up to four images per request and control the aspect ratio across seven presets. The response contains either temporary URLs or base64-encoded image data, depending on the response_format parameter you pass. One thing to watch: the URLs are ephemeral, so if you're building anything production-grade, download and persist the images immediately.

Here's what Grok produces when asked to reimagine the Stream product screenshot we analyzed earlier. We fed it the output from the grok-2-vision-1212 image understanding model:

Grok output when asked to reimagine a Stream product screenshot

OK, we’re not going to put this on our marketing site (it has a little bit of a Comic Sans look), but it is directionally correct and just from a text prompt. It captured the layout (phone mockup on the left, code on the right) and the blue gradient background, though it took creative liberties with the content. This is typical of image generation models: they understand the composition and style well, but won't reproduce exact text or code.

Let’s do it with a little bit more interesting prompt:

A glowing crystal-powered rocket launching from the red dunes of Mars, ancient alien ruins lighting up in the background as it soars into a sky full of unfamiliar constellations

Here’s the image:

Grok output when asked to create a glowing crystal-powered rocket

That’s a little more like it. Definitely has the X aesthetic. You can use this for any image generation task:

  • Generating product mockups, social media assets, or placeholder art from text descriptions
  • Creating stylized interpretations
  • Iterative image editing, where you pass an existing image back in with a text instruction to modify it

The more interesting use case for image generation comes when you pair it with understanding. Feed Grok Vision's output directly into Aurora's input, and you get a pipeline that can see a scene, describe it, and reimagine it in any style, all through the same API.

Video Generation with Grok

Video generation uses yet another model, grok-imagine-video, and works differently from the other two endpoints because it's asynchronous.

Step one is submitting the request:

typescript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
const response = await fetch("https://api.x.ai/v1/videos/generations", { method: "POST", headers: { "Content-Type": "application/json", Authorization: `Bearer ${process.env.XAI_API_KEY}`, }, body: JSON.stringify({ model: "grok-imagine-video", prompt, duration: 5, // 1-15 seconds resolution: "480p", // or "720p" aspect_ratio: "16:9", image: sourceImage ? { url: sourceImage } : undefined, }), }); const { request_id } = await response.json();

You have control over duration (1 to 15 seconds), resolution (480p or 720p), and aspect ratio. Longer durations and higher resolutions take more time to render.

In practice, a 5-second 480p video typically returns in 1 to 3 minutes, while a 15-second 720p video can take closer to 5 minutes. The image field is optional: include it to enable image-to-video animation; leave it out to generate pure text-to-video.

This returns immediately with a request_id. Step two is polling for the result:

typescript
1
2
3
4
5
6
7
8
9
const statusResponse = await fetch( `https://api.x.ai/v1/videos/${requestId}` ); const data = await statusResponse.json(); // Three possible states: // { video: { url, duration } } → done // { status: "pending" } → still processing // { status: "expired" } → timed out

The API doesn't return a status field when the video is ready. Instead, you check for the presence of data.video.url. This is a quirk worth knowing about, since if you're only checking a status string, you'll miss completed videos.

typescript
1
2
const isDone = !!data.video?.url; const status = isDone ? "done" : data.status || "pending";

On the frontend, we poll every 15 seconds with a 10-minute timeout:

typescript
1
2
3
4
5
6
7
// Start polling pollingRef.current = setInterval(() => { pollStatus(data.requestId); }, 15000); // First check after 10 seconds setTimeout(() => pollStatus(data.requestId), 10000);

Polling every 15 seconds is a reasonable interval. Going faster won't make the video render sooner; it will just burn more API calls.

The video endpoint also supports image-to-video generation: pass a source image along with a prompt describing how to animate it. This is particularly useful when combined with Grok's image generation, since you can create an image and then bring it to life.

Let’s do just that. We’ll pass it our nifty rocket ship from the image generation and make it lift off:

Integrate LLMs fast! Our UI components are perfect for any AI chatbot interface right out of the box. Try them today and launch tomorrow!

Make sure your sound is turned up. You don’t just get video, you also get audio. This is ideal for:

  • Generating short product demos or explainer clips from a single text prompt
  • Animating AI-generated images (create a still with grok-imagine-image, then bring it to life with grok-imagine-video)
  • Producing social content with synchronized audio without needing a separate audio generation step

The async pattern requires a little more work to implement than a synchronous call, but it enables workflows that aren't practical with the other endpoints alone. As we'll see, chaining all three capabilities together is where Grok's vision stack really comes alive.

Chaining Groks in a Vision AI Pipeline

The examples above are compelling, but the power of vision AI compounds when you combine understanding and generation models. To do that, we could build a pipeline from scratch, but instead, we’re going to build on Vision Agents, an open-source framework for real-time vision AI applications.

We’ll build a Scene Narrator, a real-time AI agent that:

  1. Captures frames from a live video call
  2. Sends each frame to Grok Vision for scene analysis
  3. Takes that description and generates a stylized artistic interpretation
  4. Publishes the generated image back as a video track
  5. Narrates the scene aloud using text-to-speech

The core processing loop lives in SceneProcessor, which captures one frame every five seconds and runs it through the Grok pipeline:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
async def _on_frame(self, frame: av.VideoFrame): if self._processing: logger.debug("Skipping frame — previous analysis still in progress") return self._processing = True try: # Convert frame to base64 image_b64 = self._frame_to_base64(frame) # Step 1: Analyze with Grok Vision description = await self._analyze_scene(image_b64) # Step 2: Generate stylized image stylized_frame = await self._generate_stylized_image(description) if stylized_frame: await self._video_track.add_frame(stylized_frame) # Step 3: Emit event for narration self._events.send( SceneAnalyzedEvent(description=description, style=self._current_style) ) finally: self._processing = False

This is the entire pipeline in one method. Every five seconds, a video frame arrives. The _processing guard at the top is important: since both the vision analysis and image generation calls take a few seconds each, a new frame could arrive while the previous one is still being processed. Without this guard, you'd stack overlapping API calls and get a narration pileup. Instead, the processor simply skips any frame that arrives while a previous one is in flight.

Let's walk through each step.

Step 1: Scene Analysis with Grok Vision

The first call sends the captured frame to grok-2-vision-1212, the same vision model we used earlier. The frame is converted to a base64-encoded JPEG and sent as a structured content block:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
async def _analyze_scene(self, image_b64: str) -> str | None: response = await client.post( f"{XAI_BASE_URL}/responses", headers={ "Content-Type": "application/json", "Authorization": f"Bearer {api_key}", }, json={ "model": "grok-2-vision-1212", "input": [ { "role": "user", "content": [ { "type": "input_image", "image_url": f"data:image/jpeg;base64,{image_b64}", "detail": "high", }, { "type": "input_text", "text": ( "Describe this scene in 1-2 vivid sentences. " "Focus on key subjects, their actions, the setting, " "and the overall mood. Be specific and evocative." ), }, ], } ], }, )

This is the same input_image + input_text pattern from the image understanding section, but the prompt is tuned for brevity. We want 1-2 sentences, not the paragraph-length analysis we got when examining the Stream screenshot. The description needs to be short enough to work as both a narration script (read aloud via TTS) and an image-generation prompt (fed into Aurora in the next step).

The response parsing extracts the text from xAI's nested response format:

python
1
2
3
4
5
for output_item in data.get("output", []): if output_item.get("type") == "message": for content in output_item.get("content", []): if content.get("type") == "output_text": return content["text"]

Step 2: Stylized Image Generation with Aurora

The description from Step 1 becomes the prompt for grok-imagine-image. The processor prepends the current art style to create the full prompt:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
async def _generate_stylized_image(self, description: str) -> av.VideoFrame | None: prompt = f"{self._current_style} of: {description}" response = await client.post( f"{XAI_BASE_URL}/images/generations", headers={ "Content-Type": "application/json", "Authorization": f"Bearer {api_key}", }, json={ "model": "grok-imagine-image", "prompt": prompt, "n": 1, "response_format": "b64_json", }, )

So if the vision model described the scene as "A man in glasses sits at a wooden table by a large window, resting his chin on his hand with a coffee mug beside him, bare winter trees visible outside," and the current style is "anime illustration," the full prompt becomes:

anime illustration of: A man in glasses sits at a wooden table by a large window, resting his chin on his hand with a coffee mug beside him, bare winter trees visible outside

Notice we're requesting b64_json as the response format rather than a URL. In the standalone image generation demo, URLs are fine because a browser can load them directly. But here, the generated image needs to be converted into a video frame, so we need the raw pixel data.

The processor decodes the base64 response, resizes it to match the source video dimensions, and converts it to an av.VideoFrame:

python
1
2
3
4
5
img = Image.open(io.BytesIO(img_bytes)).convert("RGB") img = img.resize( (self._frame_width, self._frame_height), Image.Resampling.LANCZOS ) video_frame = av.VideoFrame.from_ndarray(np.array(img), format="rgb24")

This frame is then published to a QueuedVideoTrack, which Stream's video infrastructure picks up and broadcasts to all participants in the call. That's how the anime-style interpretation appears as the Scene Narrator's video feed.

Step 3: Narration

The final step emits a SceneAnalyzedEvent containing the description text. The agent subscribes to this event and speaks the description aloud:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
@agent.events.subscribe async def on_scene_analyzed(event: SceneAnalyzedEvent): if not scene_processor.narration_enabled: return if _narrating: logger.debug("Skipping narration — already speaking") return _narrating = True try: await agent.say(event.description) finally: _narrating = False

The agent.say() call sends the text through ElevenLabs' TTS (eleven_flash_v2_5) and plays the resulting audio in the video call. There's a second guard here (_narrating) that prevents overlapping speech, the same pattern as the frame processing guard. If the agent is still speaking when a new scene analysis arrives, it skips rather than queuing up a backlog of narrations.

Changing Styles with Voice Commands

The agent also listens for voice input through Deepgram's STT. When a user says something like "switch to watercolor style," the LLM (grok-3) routes that to a registered function:

python
1
2
3
4
5
6
@llm.register_function( description="Change the art style used for image generation." ) async def set_art_style(style: str) -> Dict[str, Any]: result = scene_processor.set_style(style) return {"result": result, "current_style": scene_processor.current_style}

This updates self._current_style on the processor, so the next frame processed will use the new style prefix in its image-generation prompt. The change takes effect in the next 5-second cycle; no restart is required.

Here's the result with the style set to "anime illustration":

On the left is the raw camera feed. On the right is Grok's interpretation: the vision model described the scene, Aurora reimagined it in anime style, and the result was published back as a video track, all within a few seconds.

So, here's the full data flow for a single frame:

  1. Camera frame captured by Vision Agents at 0.2 fps (one frame every 5 seconds)
  2. Base64 JPEG sent to grok-2-vision-1212 for scene understanding
  3. Text description prepended with art style, sent to grok-imagine-image for generation
  4. Generated image decoded, resized, and published as a video frame via Stream
  5. Same text description sent to ElevenLabs TTS, played as audio in the call

Three Grok models in sequence (grok-2-vision-1212, grok-imagine-image, and grok-3 for conversation routing), plus Deepgram for listening and ElevenLabs for speaking. The 5-second interval gives each cycle enough time to complete before the next frame arrives, and the processing guards ensure graceful degradation if any step runs long.

Prerequisites

To run the full project, you'll need API keys from four services:

  • xAI for Grok Vision, Image Generation, and the LLM (grok-3-latest). Get one at console.x.ai.
  • Stream for video calling infrastructure. Sign up at dashboard.getstream.io.
  • Deepgram for speech-to-text (so the Scene Narrator can hear you).
  • ElevenLabs for text-to-speech (so the Scene Narrator can speak).

If you only want to run the image understanding, generation, and video demos, you just need the xAI key. The other three are only required for the Scene Narrator pipeline.

Clone the repo and copy the environment file:

shell
1
2
3
git clone https://github.com/argotdev/grok-vision-demo.git cp .env.example .env # Add your API keys to .env

For the demo app:

shell
1
2
npm install npm run dev

For the Scene Narrator, start the demo app first (it hosts the video call UI at the /narrator page), then in a separate terminal:

shell
1
2
cd scene-narrator uv run scene_narrator.py run --call-id <your-call-id>

The call ID is displayed in the browser after you click "Start Call" on the /narrator page.

Grok Deserves a Seat at the Vision AI Table

The narrative around Grok has been dominated by its relationship with X: the memes, the "spicy mode," the edgy persona. That's a shame, because underneath all of that is a genuinely capable vision stack with an interesting architectural story.

Aurora's autoregressive approach to image generation is a real differentiator, not just a technical curiosity. The fact that you can chain understanding, generation, and video through a single API provider, with OpenAI-compatible SDKs and competitive pricing, makes Grok a practical choice for vision pipelines, not just a novelty.

The Scene Narrator we built here with Vision Agents is one example, but the pattern is general. Any workflow that needs to see, interpret, and create can be built on the same three-model chain: feed an image to Grok Vision, pass the description to Aurora, and optionally animate the result with Grok Imagine Video. Swap in different prompts, styles, or downstream actions, and you have content moderation pipelines, automated product photography, real-time accessibility tools, or creative applications we haven't thought of yet.

Grok's problem was never capability. It was distribution and developer mindshare. Hopefully, this helps with the latter.

Ready to Increase App Engagement?
Integrate Stream’s real-time communication components today and watch your engagement rate grow overnight!
Contact Us Today!