Best Architecture for Real-Time Vision AI Systems

The honest answer is that almost every production system ends up hybrid. You run something small and fast at the edge, like a quantized YOLO on a Jetson or MediaPipe on an Android. The ambiguous frames get escalated to a beefier model in the cloud. Pure edge wins when latency, privacy, bandwidth, or offline operation is the dominant constraint. Pure cloud only really wins when your model is too big to quantize down to edge hardware, and you can afford to spend more than 300ms on the round trip.

What does a real-time vision AI pipeline look like end-to-end?

Most real-time vision pipelines have the same six stages: capture, decode and preprocess, inference, post-process, tracking and state, then action or render. Each one adds latency, and the surprising thing when you start instrumenting a pipeline is that most of the latency isn't actually in the model.

The metric that matters here is glass-to-glass latency: how long it takes from the moment photons hit the sensor to the moment a result reaches a display or actuator. Most of the delay comes from sensor exposure, encoder B-frame buffering, network jitter, and the decoded picture buffer waiting to reorder frames. Some typical stage latencies:

Stage	Typical contribution
Sensor exposure + readout	2-40 ms
H.264/H.265 encode (with B-frames)	30-100 ms
Network transport	50 ms to several seconds
Decode + DPB reorder	5-80 ms
Pixel format conversion (NV12 to RGB)	1-10 ms
Inference (YOLO11s INT8 on T4)	5-15 ms
Tracking + post-process	1-5 ms

Latency budgets vary a lot by use case. Robotics and teleoperation need sub-50ms for closed-loop control. AR overlays and in-call effects must be under 100ms to match perceived motion. Sports broadcast graphics tolerate up to 200ms before they drift out of sync with the feed. Live-stream moderation needs to be under 300ms for blocking actions, though 1-3 seconds is fine for a human review handoff.

Most teams end up with roughly the same stack:

GStreamer or NVIDIA DeepStream for capture and decode
NVDEC zero-copy into TensorRT or Triton on the GPU
ByteTrack or BoT-SORT for tracking state
Kafka or MQTT for internal event aggregation
A vector database (Milvus, Qdrant, pgvector) for search and audit
Kubernetes plus KServe handles the elastic GPU fleet

When events need to fan out to end users (followers of a creator, subscribers to a camera, viewers of a stream), a managed feed layer, such as Stream Activity Feeds, sits between the event bus and the apps.

Should I run inference at the edge or in the cloud?

Pure edge or pure cloud both work, but the situations where one wins outright are narrower than people tend to assume.

Edge first when any of these hold:

Hard real-time latency budget under 100ms
Raw video legally or contractually can't leave the premises (HIPAA, GDPR, biometrics)
Bandwidth is constrained or metered (cellular cameras, drones)
Operation has to continue offline

Cloud GPU first when:

The model is too large to quantize onto edge silicon (any VLM over 7B parameters, SAM 2 Large, RF-DETR Large)
You need uniformly high accuracy across a fleet
Centralized rollout, A/B testing, and versioning matter more than per-device latency
The video is already in the cloud (RTMP or WebRTC ingest from creators)

Which models should I use for detection, tracking, and segmentation in 2026?

For most workloads, the right starting point is a YOLO or RT-DETR for detection, ByteTrack or BoT-SORT for tracking, and SAM 2 for segmentation. A vision-language model is worth reaching for only when you genuinely need open vocabulary, reasoning over context, or zero-shot generalization.

Building your own app? Get access to our Livestream or Video Calling API and launch in days!

Detection

YOLOv8, YOLO11, and YOLO26 are the workhorses. YOLO26 unifies detection, segmentation, classification, pose, and oriented bounding boxes in a single architecture. RT-DETR is the NMS-free transformer alternative, and RT-DETR-L hits 53.0% AP at 114 FPS on a T4, beating equivalent YOLOs on speed and accuracy across most benchmarks.

YOLO has a broader community and an easier export path to TFLite, ONNX, Core ML, and TensorRT, making it the safer default if your deployment targets are diverse. RT-DETR is the better pick when you can stay on CUDA end-to-end and want to skip NMS post-processing entirely. Both run comfortably under 15ms per frame at 640x640 on a T4, L4, or Orin NX with INT8 quantization.

Tracking

The current landscape, in rough order of where each one shines:

ByteTrack: fastest, no appearance model, best for static cameras. Default starting point.
OC-SORT: better than ByteTrack under camera motion.
BoT-SORT and BoT-SORT-ReID: top scores on MOT17 IDF1, MOTA, and HOTA. Adds Kalman state for width and height, camera motion compensation, and optional ReID embeddings. The right choice when ID switches are expensive.
StrongSORT: DeepSORT modernized, highest ID stability in crowded scenes, requires a GPU.
DeepSORT: largely superseded but still ships in DeepStream.

Segmentation

SAM 2 from Meta is the default for promptable or interactive segmentation. It uses a streaming memory transformer, runs 6x faster than SAM 1 on images, hits near real-time on all but the largest checkpoint on an A100, and needs 3x fewer interactions to segment a video. Mask R-CNN still has its place for closed-vocabulary, high-throughput batch jobs.

Vision-language models

These are the ones to consider for the verification stage of a cascade, captioning, open-vocabulary detection, or video Q\&A:

Frontier closed VLMs (OpenAI, Anthropic, Google): highest accuracy, API only. Good for low-volume verification or human-in-the-loop dashboards.
Qwen2.5-VL (3B, 7B, 72B): the open-weights leader. The 7B supports 16GB of VRAM at 4-bit, native image resolution, and video at fps=1.0.
PaliGemma 2 and Gemma 3 multimodal: Google open weights, strong fine-tuning ergonomics, native bounding-box output, edge-deployable variants.
Florence-2: purpose-built for detection, segmentation, OCR, and captioning via prompt-conditioned seq2seq. A practical edge VLM choice.
Moondream: a compact VLM optimized for edge devices and low-latency video processing. Useful for real-time captioning and scene analysis with minimal resource demands.
LLaVA-Guard, ShieldGemma 2: purpose-built multimodal safety models for moderation pipelines.

Task-specific models still beat VLMs by one or two orders of magnitude in cost, latency, and per-class accuracy. The places a VLM earns its keep are open-vocabulary or zero-shot tasks, reasoning over context (for example, "is this person holding a weapon while in a school"), the second stage of a cascade, or labeling and bootstrapping a dataset.

Wiring these models together for a real-time call usually means pairing a fast specialized model (YOLO, Roboflow, Moondream) with a real-time VLM (Gemini Live, OpenAI Realtime, Qwen 3-VL) over a WebRTC track. The Vision Agents framework is built around exactly this composition. A live coaching app with pose detection and a real-time VLM ends up looking something like this:

1
2
3
4
5
6
7
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Coach"),
    instructions="Watch the swing and give feedback.",
    llm=gemini.Realtime(),
    processors=[ultralytics.YOLOPose(model="yolo11s-pose")],
)

The framework handles WebRTC ingestion, routing frames into the processor, the VLM call, and returning the response over the same track. Stream's edge network targets sub-500ms call join time and under 30ms in-call audio/video latency, leaving your inference budget mostly intact for the model serving stage.

What are the most common architectural mistakes teams make?

When something goes wrong in production, it's usually one of these.

Over-engineering with cloud when edge would work. If your latency budget is under 300ms and the model has under 50M parameters, the network round-trip time alone usually justifies an edge.
Under-engineering with edge when accuracy required cloud processing. A YOLOv8n on a Coral TPU is not a substitute for an RF-DETR Large or Florence-2 when the cost of a false positive is high. Use a cascade so you get edge speed and cloud accuracy on the frames that need it.
Treating video codec and pixel format conversion as free. Every NV12-to-RGB conversion on the host costs 5-10ms at 1080p. Multiply that by 30 cameras, and the CPU is the bottleneck, not the GPU.
Not using NVDEC and NVENC. CPU H.265 transcoding is roughly 10-50x slower than GPU at the same quality. Snap reduced HEVC transcoding cost by 80% just by moving to a GPU.
Static batching in interactive workloads. Dynamic batching with a 5-20ms max-delay is almost always the right default. Static batching guarantees bad tail latency.
No graceful degradation under network jitter. Pipelines have to drop stale frames, reduce target FPS, and fall back to local-only inference when packets get lost.
Privacy and retention oversights. Embeddings are safer to store than frames, and they should be encrypted at rest. Unless you have an explicit consent flow, default to face detection without recognition and document the policy. GDPR, BIPA, and the EU AI Act all materially constrain biometric tracking, and the legal landscape continues to shift.
Cold-start surprises. CUDA graph compilation, TensorRT engine builds, and model loads can add 30 to 300 seconds on a cold start. The fix is to pre-warm with a synthetic inference at startup and pin model files to NVMe.

One subtler issue worth flagging: VLMs hallucinate small text. Scores, jersey numbers, signs, license plates, and timestamps are all common failure modes. If your application depends on reading text from video, pair the VLM with a dedicated OCR model, such as PaddleOCR or Florence-2's OCR head, rather than relying on the VLM alone.

What Is the Best Architecture for Real-Time Vision AI Systems?