Build multi-modal AI applications using our new open-source Vision AI SDK.

Chat Application Architecture, Explained

New
23 min read
Raymond F
Raymond F
Published May 12, 2026

TLDR;

  • Wide-column stores like Cassandra handle messages while Redis holds read state, because each subsystem's access patterns differ significantly.
  • Presence alone generates a write on every connect, disconnect, and heartbeat, making it orders of magnitude more write-heavy than messages.
  • End-to-end encryption prevents the server from searching, moderating, or generating push notification previews, which transport-plus-at-rest encryption avoids.
  • The core message flow takes weeks to build, but mobile reconnection, group fan-out, and multi-device sync realistically take years to production-harden.

Chat is deceptively simple (what’s difficult about sending text messages back and forth?), but straightforwardly hard.

It’s not just text between two people. It might be text between hundreds. It might not be text at all, but a 200MB video. It might be that the video has to travel over a cellular connection that keeps dropping. It might be that the sender is in Sydney, and everyone else is in São Paulo. It might be that you want to know who really liked the video.

Each of these features pulls the chat application architecture in a different direction, and a real chat system has to handle them all at once.

What a Chat Application Has To Do

Before we draw any boxes, it helps to be specific about what the system is on the hook for. A lot of the architecture decisions later in this article only make sense once you take the full requirement list seriously.

A real chat application has to handle all of the following, and handle them simultaneously:

  • Deliver messages in order, without losing them. Within a single conversation, messages must appear in the order they were sent. Duplicates are fine if the client can dedupe them, but lost messages are never OK.
  • Keep every device in sync. A user signs in on their phone, their laptop, and a tablet they forgot about on the couch. All three show the same messages, the same read state, and the same unread counts. When they mark something read on the phone, the laptop's notification badge clears a second later.
  • Handle the recipient being offline. Their phone is in airplane mode; they're on a flight or underground. The message has to persist somewhere durable and replay cleanly when they come back, with no gaps and no duplicates.
  • Support groups ranging from 3 to 50,000 people. A family group chat and a large community channel are both "group chats," but they're very different architectural problems, and a real system handles both.
  • Deliver receipts and typing indicators. Sent, delivered, read, "someone is typing." These aren’t trivial. In a large group, a single message can trigger thousands of receipts, and the system must handle that traffic without choking.
  • Track presence. Users want to know who's online, who's offline, and sometimes when someone was last seen. It sounds like a small feature, and it's one of the higher-volume subsystems in the whole application.
  • Handle media. Images, video, audio, documents. A text message is a few hundred bytes. A video can be a few hundred megabytes. They can't go through the same pipe.
  • Let users scroll back through history and search it. Someone wants to find the restaurant their friend recommended eight months ago. Both the scroll-back and the search are load-bearing on the storage model.

All this, and the system has to be low-latency, consistent within a conversation, highly available, and horizontally scalable. The interesting part is what happens when those goals conflict.

The Architecture at a Glance

A production chat system is a handful of specialized subsystems, each solving a different, hard problem, stitched together by a transport layer and a few shared data stores. Before we go deep on any one of them, here's the high-level map.

The system breaks down into six zones, plus a handful of supporting infrastructure:

  • The edge and transport layer is how clients connect and stay connected. An edge load balancer routes traffic, a service discovery layer picks the right chat server for a given user, and a pool of WebSocket servers holds the long-lived connections to every online client. This is the stateful tier, and most of the interesting operational work happens here.
  • The message path is what happens to a message from send to delivery. A message service handles ingestion, assigns IDs, and writes to durable storage. A message queue sits behind it to decouple delivery from ingestion, so a burst of traffic in one channel doesn't block the rest of the system.
  • The group and fan-out layer is how a single send becomes many deliveries. A group service owns membership and metadata. A group message handler determines who should receive a given message at this time and routes it to the appropriate WebSocket servers. For large channels, this is where the hardest scaling work lives.
  • The ephemeral state layer holds presence, typing indicators, and read cursors. It's separate from message delivery because the access patterns are different: very high write frequency, low durability requirements, and mostly served from memory.
  • The media path handles everything that isn't a text message. An asset service takes uploads, writes to blob storage, and serves downloads through a CDN. It runs parallel to the message path because media is heavy, and the WebSocket servers should stay lightweight.
  • The user and identity layer is auth, profiles, contacts, and channel memberships. The most conventional part of the whole system is usually a relational store with a cache in front of it.

Around all of this sits supporting infrastructure: push notifications for offline delivery, monitoring, and moderation hooks. Important in production, mostly out of scope for the architecture walkthrough.

A few other things are deliberately out of scope: client-side architecture (offline queues, local storage, and device-level conflict resolution), bot and integration platforms, and the full design of end-to-end encryption protocols. Each deserves its own article.

With the map in place, let's go through each zone.

The Edge and Transport Layer

HTTP doesn't work for chat on its own. It's client-initiated, so the server can't push a new message to a client unless the client requests it. You can fake it with polling or long polling, but both have obvious problems: polling generates huge request volume for mostly empty responses, and long polling holds connections open anyway while being harder to reason about than a proper bidirectional protocol.

WebSocket is the standard answer. After an HTTP handshake, the connection stays open, and both sides can write to it whenever they want. That's what real-time messaging needs.

What the transport layer has to handle

  • The WebSocket handshake and connection lifecycle
  • Heartbeats and dead-connection detection
  • Reconnection with backoff when connections drop
  • Routing the user to a chat server with capacity, ideally in their region
  • Falling back to HTTP for non-realtime operations like login, profile updates, and history fetches

A common mistake in naive designs is pushing everything through WebSocket. Most chat applications use WebSocket for the live message stream and standard HTTP for everything else. Login, profile, contact list, media upload, and history pagination are all request-response operations that don't benefit from a persistent connection.

Connection lifecycle on mobile

Mobile is where the transport layer earns its complexity. A user walks from WiFi to cellular, and the connection drops. They background the app, and the OS kills the socket after 30 seconds. They get a phone call. They go underground.

A robust chat client spends a surprising amount of code on reconnect logic:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Simplified reconnection with exponential backoff class ChatConnection { connect() { this.ws = new WebSocket(this.url); this.ws.onclose = () => this.handleDisconnect(); this.ws.onmessage = (e) => this.handleMessage(e); this.ws.onopen = () => { this.backoff = 1000; this.syncMissedMessages(); }; } handleDisconnect() { const delay = Math.min(this.backoff, 30000); setTimeout(() => this.connect(), delay); this.backoff *= 2; } syncMissedMessages() { // Ask the server for everything since our last known message this.send({ type: 'sync', cursor: this.lastMessageId }); } }

The syncMissedMessages call is the hook into the sync subsystem we'll cover later. Every reconnect is also a sync, because there's almost always a gap.

Alternatives worth knowing about:

  • Server-Sent Events are one-way and simpler, useful for activity feeds and notifications, but not for interactive chat
  • MQTT is common in mobile chat stacks because it has smaller headers and better battery characteristics than WebSocket
  • Raw TCP shows up in legacy systems and in a few very-high-scale deployments, but WebSocket is the default for new work

Managed chat infrastructure handles the WebSocket lifecycle, reconnection with backoff, and regional routing out of the box, which is most of what the transport layer needs to do correctly.

Stateless and Stateful Services

Most web applications are stateless all the way through. A request comes in; any server can handle it; the server writes to a database; the response goes out; and nobody cares who did the work. Chat is different. The chat servers that hold WebSocket connections are stateful, and that fact has ripple effects throughout the whole architecture.

A user's connection is pinned to a specific chat server for the length of their session. Which means:

  • The load balancer in front of the chat tier can't round-robin requests. It needs sticky routing, usually by a user ID hash or a session token.
  • Deploying the chat tier means migrating connections. Clients experience this as a brief reconnect, which is fine if the sync layer handles it correctly.
  • Capacity planning cares about concurrent connections, not requests per second. A single well-tuned server running Go or Elixir can hold a million concurrent WebSocket connections, but the memory footprint per connection matters enormously when you're trying to pack that many into one box.
  • Losing a chat server drops every connection on it. Clients reconnect, service discovery places them somewhere new, and they sync. It's expected and routine, not an incident.

Everything else in the system (auth, profiles, contact lists, history fetches, media uploads) is stateless as usual. Regular load balancers, regular scaling, and regular boring operations work. The interesting architecture is all in the stateful tier.

Service discovery

When a client wants to connect, service discovery answers a single question: which chat server should I use? The answer depends on which region the client is in, which chat servers have capacity, and whether the user already has an active session that should resume rather than start fresh.

Apache Zookeeper is the textbook answer and still shows up in plenty of production stacks. Modern alternatives include DNS-based discovery, Consul, etcd, and the service mesh primitives built into Kubernetes. The specific choice matters less than making sure the discovery tier itself is highly available, because every new session depends on it.

Data Modeling for Messages

The "SQL vs NoSQL" framing most articles use for chat is too simplistic. Real systems use multiple stores because different subsystems have fundamentally different access patterns.

What you're actually storing:

DataAccess patternTypical store
MessagesAppend, read recent, paginate backwardWide-column (Cassandra, ScyllaDB) or sharded relational
Channels and membershipsRead-heavy, occasional writesRelational, with a Redis cache in front
Users and profilesStandard CRUDRelational
Read state and cursorsVery high write frequency, per userRedis or similar, periodic persistence
PresenceEphemeral, extremely high churnIn-memory only, usually not persisted
Media metadataLow volume, pointer to blob storageRelational
Message search indexFull-text, eventually consistentElasticsearch, OpenSearch, or Meilisearch

Messages are the interesting one. The access pattern is append-heavy with a strong recency bias: most reads are for the last few hours of a channel, occasional reads go further back. Wide-column stores are a natural fit because they handle this pattern well and scale horizontally by partitioning on channel ID.

Message ordering depends on the ID scheme, and there are three real options.

  1. Timestamps don't work as ordering keys. Server clocks drift. Two messages sent in the same millisecond on different servers can't be ordered by timestamp alone, and even NTP-synced clocks can disagree by tens of milliseconds under load.
  2. Global sequence generators work, but add a dependency. Snowflake-style IDs give you 64-bit values that are sortable by time and unique across the fleet. They require a generator service or a library to encode machine IDs, and they become a point of contention in very high-throughput systems.
  3. Per-channel sequence numbers are usually enough. You don't need global ordering across the whole system. You need ordering within a conversation. A simple per-channel counter, managed by the chat server or database that owns writes for that channel, gives you that.

The message table for a group channel ends up looking something like this:

CREATE TABLE messages (
  channel_id    BIGINT,
  message_id    BIGINT,   -- per-channel sequence or Snowflake
  sender_id     BIGINT,
  content       TEXT,
  content_type  VARCHAR(20),
  created_at    TIMESTAMP,
  PRIMARY KEY ((channel_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

channel_id is the partition key because nearly every read is scoped to a single channel. message_id is the clustering key in descending order because the most common read is “give me the last N messages in this channel.”

One-On-One Message Flow

The actual message flow is straightforward in the happy path and interesting in the edge cases.

The happy path is when both users are online, connected to chat servers in the same region:

  1. Alice's client sends the message over its WebSocket to chat server A.
  2. Server A persists the message via the message service, which assigns an ID and writes to the message store.
  3. Server A acks the send back to Alice's client, which transitions the message from “sending” to “sent.”
  4. Server A looks up where Bob is connected. This is a call to a routing service or a pub/sub lookup backed by Redis.
  5. Server A forwards the message to Bob's chat server B.
  6. Server B delivers the message over Bob's WebSocket.
  7. Bob's client acks delivery, which propagates back to Alice as a “delivered” receipt.
  8. When Bob's client marks the message read, that propagates back as a “read” receipt.

The fun thing about chat is that the edge cases that actually happen are numerous:

  • Bob is offline. The message persists, a push notification fires via APNs or FCM, and the message replays on reconnect via the sync subsystem. Delivery receipt waits until Bob's client actually comes back.
  • Bob has three devices online. Fan out to all three chat servers holding his connections, track per-device delivery, and let the client-side sync handle cross-device read state.
  • Bob's device is connected but backgrounded. Deliver via WebSocket and send a push notification. The client deduplicates by message ID when it comes back to the foreground.
  • Network partition between server A and server B. This is where CAP shows up concretely. Most systems queue the message at server A, retry, and rely on the durable message store as the source of truth. Alice's client shows “sending” longer than usual, but the message doesn't get lost or reordered.

The important architectural point is that every message has a durable home in the message store before the system tries to deliver it. Delivery is best-effort and retryable; persistence is the guarantee.

Group Chat and Fan-Out

A 1:1 message is one write and at most one delivery. A message to a 1,000-person group is one write and up to 1,000 deliveries. A message to a 100,000-member channel in a large community can be a single write and 100,000 deliveries, most of them to users who aren't actively reading and never will be.

Handling this well is the single biggest architectural choice in a chat system.

Write fan-out (push model)

When a message is sent, write a copy to each recipient's inbox, usually represented as a per-user message queue or an indexed slice of the message table. The strengths of fan-out are that, when a user opens a channel, their inbox is already populated. Unread counts and read state are easy because they're per-user by construction.

The weaknesses:

  • Writes scale linearly with group size. A 10,000-person channel does 10,000 writes per message.
  • Storage amplifies with group size. The same message lives in many places.
  • Deleting or editing a message means touching all the copies.

The write fan-out model is the right approach for direct messages and small-to-medium groups, up to a few hundred members.

Read fan-out (pull model)

Store the message once, against the channel. When a user opens the channel, they read from it directly. The strengths of fan-out are:

  • One write per message regardless of group size.
  • No storage amplification.
  • Edits and deletes are trivial because there's one copy.

But reads are more expensive because the system has to assemble each user's view at read time. Real-time delivery to connected users still requires a fan-out step, even if storage doesn't, and unread counts become harder because they can't just be "count the things in your inbox."

Still, read fan-out is the right model for large channels where most members are passive readers.

Write fan-out (push)Read fan-out (pull)
How it worksWhen a message is sent, copy it into each recipient's inbox.Store the message once against the channel. Recipients read from the channel when they open it.
Writes per message (1,000-member channel)1,0001
Storage costAmplifies with group size. Same message stored 1,000 times.Stored once regardless of group size.
Read costCheap. The user's inbox is already assembled.More expensive. System assembles the user's view at read time.
Unread countsEasy. Count unread items in the inbox.Harder. Requires comparing the user's read cursor to the channel's messages.
Edits and deletesExpensive. Have to update every copy.Trivial. One copy to update.
Real-time deliverySame fan-out work to connected WebSocket servers either way.Same fan-out work to connected WebSocket servers either way.
Best forDirect messages and small-to-medium groups (up to a few hundred members).Large channels where most members are passive readers.

Production chat platforms typically combine the two. Direct messages and small groups use write fan-out for the fast reads and simple unread tracking. Large channels use read fan-out to keep writes and storage under control. The threshold between them is usually configurable per channel type, and "channel type" itself becomes a first-class concept in the data model.

The role of the message queue

A durable queue (Kafka, Pulsar, NATS, or similar) sits between message ingestion and delivery. This matters for several reasons:

  • It absorbs bursts. A viral post in a large channel can spike fan-out work by orders of magnitude. The queue lets ingestion stay fast while delivery catches up.
  • It allows retry. Failed deliveries (a chat server that went down, a push notification service that's slow) can be retried without blocking the send path.
  • It enables decoupling. The message service doesn't need to know where every recipient is connected; it publishes to the queue, and consumers (group message handlers) pick up the work.

Building this correctly at scale is nontrivial, which is one reason teams reach for managed chat infrastructure. Stream's chat product, for instance, handles write/read fan-out, batched receipts, and the queue plumbing behind the SDKs.

Presence, Typing, and Receipts

Presence, typing, and read receipts live in their own services, separate from the message path. The workloads are different enough that mixing them would hurt both. Presence alone generates a write every time someone connects, disconnects, or sends a heartbeat, which is orders of magnitude more volume than actual messages.

At the same time, none of it needs to be durable. If the presence service restarts, clients reconnect, and the state rebuilds itself in seconds. Messages get the opposite treatment: lower write volume, but every single one has to survive forever. Keeping the two subsystems separate lets each be optimized for its own workload, usually with Redis or in-process memory for the ephemeral side and a durable store for messages.

Presence

See how Stream’s chat infrastructure handled 5M concurrent connections—without compromising latency or stability.

Who's online right now, and who needs to know. Presence has two hard parts:

  1. Writes are constant. Every connect, disconnect, and heartbeat is a presence event. A user with a flaky connection generates presence churn every few minutes.
  2. Fan-out is wide. If Alice has 500 contacts who subscribe to her presence, every status change fans out to 500 subscribers. Multiply by the user base and presence traffic can easily exceed message traffic.

The usual design:

  • Store presence in Redis or in-process state on the chat servers, keyed by user ID
  • Use a pub/sub layer (Redis pub/sub, NATS, or similar) to broadcast changes to interested subscribers
  • Accept that presence is eventually consistent, with a few seconds of lag being normal
  • For very large audiences, use coarse-grained presence ("online" vs "offline") rather than fine-grained ("last seen 2 minutes ago"). Coarse-grained presence can be computed lazily at read time from the connection state.

Typing indicators

Even more ephemeral than presence. Almost always implemented as a fire-and-forget pub/sub event with a short TTL. The only durable part is the client-side debouncing: don't send a typing event on every keystroke, send one when typing starts and another when it stops or times out.

A minimal client-side implementation:

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Client-side typing indicator with debouncing class TypingIndicator { constructor(channelId, send) { this.channelId = channelId; this.send = send; this.isTyping = false; this.stopTimer = null; } onKeystroke() { if (!this.isTyping) { this.send({ type: 'typing_start', channel: this.channelId }); this.isTyping = true; } // Reset the stop timer on every keystroke. // If the user pauses for 3 seconds, we send typing_stop. clearTimeout(this.stopTimer); this.stopTimer = setTimeout(() => this.stop(), 3000); } stop() { if (!this.isTyping) return; this.send({ type: 'typing_stop', channel: this.channelId }); this.isTyping = false; clearTimeout(this.stopTimer); } }

Read receipts

Per-user, per-channel cursor showing the last message ID the user has seen. Updates every time they read, which in a high-traffic channel can be many updates per minute per user.

A possible design:

  • Store the cursor in a fast KV store (Redis is typical)
  • Persist to durable storage periodically or on significant events
  • Propagate changes to the user's other devices via pub/sub on their personal channel
  • Derive "read by" views for group messages by comparing per-user cursors to message IDs

Read receipts also present an issue at scale. In a 1:1 chat, one delivery receipt and one read receipt per message. Manageable. In a 1,000-person group, you could get 1,000 delivered receipts and 1,000 read receipts for a single message. Do that for every message, and the receipt traffic dwarfs the message traffic.

Developers have a few possible mitigations:

  • Batching. Aggregate receipts on the server and deliver them in batches rather than individually.
  • Sampling or aggregation. Show "read by 847 people" instead of a list of 847 names, and compute the number from a counter rather than reading every receipt.
  • Threshold suppression. Above a group-size threshold, stop tracking individual read receipts and track only the channel-level last-read-message-id per user.

Cutting corners on these subsystems is tempting because nothing is durable and nothing is the source of truth. Users notice anyway. Laggy presence and missing typing indicators make a chat app feel broken even when messaging works perfectly.

Media and Attachments

Media runs parallel to the message path because the traffic profile is completely different. A text message is a few hundred bytes. An image is a few hundred kilobytes. A video can be hundreds of megabytes. Forcing all of that through the WebSocket servers would wreck their connection capacity.

The flow:

  1. The client compresses and optionally encrypts the file locally
  2. The client uploads over HTTP to an asset service, which writes to blob storage (S3, GCS, or Azure Blob) and returns a media ID
  3. The client sends a regular text message that references the media ID
  4. The recipient's client receives the message, sees the media reference, and fetches the media from a CDN using the ID
  5. The CDN caches the file near the recipient for future requests

Like most components in chat, this only tells part of the story. Here, we’re missing:

  • Thumbnails and transcodes. Images need multiple resolutions. Videos need multiple bitrates. These are generated asynchronously by a media pipeline and referenced by the same media ID.
  • Signed URLs. For private conversations, you don't want media fetchable by anyone with the URL. Signed URLs with short TTLs tie access to the recipient.
  • Upload resumability. Large files on mobile connections fail partway through. Chunked upload protocols let the client resume instead of starting over.
  • Deduplication. Hashing the file content and storing it once avoids keeping 100 copies of the same meme that got forwarded through 100 channels.
  • Lifecycle. Many systems delete media after a retention window (30, 60, 90 days) to keep storage costs in check. Policy depends on the product.

Moderation

Modern chat systems treat moderation as part of the message path rather than a back-office workflow. Communities at any scale generate spam, harassment, scams, and worse, and the regulatory environment in most major markets now requires platforms to address them. The architectural consequence is that there's a moderation subsystem between the message service and the recipient, and it must be fast enough not to add latency to legitimate sends.

Where moderation sits in the flow

Most production systems run moderation in three places, each catching different problems on different timescales.

The synchronous lane runs in the send path itself. Fast classifiers check text, attachments, and links against rule sets and small models, with a tight latency budget to avoid slowing legitimate sends. Anything they catch with high confidence gets blocked before the recipient sees it.

The asynchronous lane branches off the message store. Once a message is durable and delivered, heavier classifiers operate through a queue, examining the message in context and across the user's recent history. Anything they flag gets retroactive action: deletion, warning, suspension, or escalation to a human reviewer.

The continuous lane doesn't look at messages at all. It watches user behavior signals (send rate, prior violations, report counts) and surfaces patterns that don't appear in any single message, such as spam rings and coordinated harassment.

Two tiers of classifier

The interesting architectural detail is what runs in the synchronous lane versus the asynchronous one. They're meaningfully different kinds of model.

  • The synchronous lane uses small, fast NLP classifiers trained for specific harm types (profanity, hate speech, spam, platform circumvention), hash matching against known abuse content, and keyword rules for the obvious cases. These run in milliseconds and have to be conservative, because false positives in the send path are user-visible and frustrating.
  • The asynchronous lane is where LLM-based review lives, and it's the part that's genuinely changed in the last few years. An LLM can evaluate context that a fast classifier can't: whether a conversation is escalating toward harassment, whether an offer is plausibly a scam, or whether a user is trying to move someone off-platform for fraud. LLMs are too slow and too expensive to run on every message in the send path, so they sit behind the queue and handle the harder cases the fast classifiers flag for follow-up.

Most production stacks use both tiers. The fast layer handles obvious garbage at scale; the LLM layer handles nuance.

Sync, History, and Unread State

A user opens the app on their phone after being away for two hours. The client needs to know, for every channel they're in:

  • What messages arrived while they were gone
  • What's been read on their other devices in the meantime
  • What the current unread count is
  • Whether anything was edited or deleted

All of this has to happen in as few round trips as possible and produce the same answer that the desktop app shows five seconds later.

Cursor-based sync

Every client tracks, for each channel it's a member of, the ID of the last message it has received. On reconnect or app open, it sends a single request asking for everything newer than its cursor for every relevant channel.

javascript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Client sends on reconnect { type: 'sync', channels: [ { channel_id: 'general', last_message_id: 48291 }, { channel_id: 'random', last_message_id: 15003 }, { channel_id: 'dm:alice', last_message_id: 9912 } ] } // Server returns the delta { type: 'sync_response', channels: { general: { messages: [...], new_read_cursor: 48350 }, random: { messages: [], new_read_cursor: 15003 }, 'dm:alice': { messages: [...], unread_count: 3 } } }

The server serves these deltas out of the message store, hitting the hot-tail cache when possible.

Multi-device read state

Read cursors are per-user, not per-device. If Alice reads a message on her phone, her desktop should reflect that within seconds. The design:

  • Every user has a personal pub/sub channel that their connected devices subscribe to
  • When any device updates a read cursor, the change publishes to the personal channel
  • Other devices update their local state on receipt

This is a small subsystem, but it's where many chat apps feel subtly off. A notification badge that doesn't clear after you read the message on another device is the kind of thing users don't articulate but do notice.

Unread counts

You can't just compute "messages in the channel after my cursor" at read time because, for a busy channel with millions of messages, that query gets expensive. The typical solution is incremental:

  • When a message arrives in a channel, increment the unread count for every member whose cursor is behind the message.
  • When a user reads up to a new cursor, recompute their unread count by counting messages between the old and new cursor (cheap, bounded by the delta) and zeroing or decrementing accordingly.
  • Store unread counts in Redis for fast access and persist periodically.

Edge cases: mentions, threaded replies, and muted channels all interact with the unread state in ways that make this subsystem more complicated than it looks.

History pagination

Scrolling backward through a channel with a year of messages. Two patterns:

  • Offset-based pagination (skip=1000, limit=50) scans and is expensive for deep pages
  • Cursor-based pagination (before_message_id=X, limit=50) uses the primary key index and stays fast regardless of depth

Cursor-based always wins for chat.

Tradeoffs

Every interesting decision in chat architecture is a tradeoff between things you'd like to have but can't have all of at once. The five below are the ones that come up most often, and the right answer to each depends on what you're actually building.

Consistency vs. availability

During a network partition, you pick one. Chat systems almost universally pick consistency within a conversation, because ordering violations are user-visible in a way that destroys trust in the product.

The two failure modes, side by side:

ChoiceWhat the user sees
Pick availabilityMessage appears to send, then reorders or disappears seconds later
Pick consistencyMessage sits in "sending" state for an extra second or two

Users learn what the spinner means. They don't learn to trust a system that loses their messages.

Latency vs. encryption

End-to-end encryption is great for the threat model where the operator shouldn't be able to read messages, and complicated for almost everything else. With true E2EE in place, the server can't do most of the things modern chat products are expected to do:

  • Search messages
  • Moderate content
  • Generate rich previews for push notifications
  • Power AI features like thread summaries or suggested replies
  • Help users recover history when they lose their device, without elaborate key escrow

Transport encryption, plus at-rest encryption, covers most threat models for business and enterprise chat while leaving the server-side capabilities intact. True E2EE using Signal Protocol or MLS is worth the complexity for consumer messengers where users expect WhatsApp-like privacy. For most other products, "encrypted in transit and at rest" is the honest answer.

Real-time fidelity vs. battery and bandwidth

A chat client that pushes every event the moment it happens feels incredibly responsive and absolutely destroys mobile battery life. The opposite extreme, where the client polls every 30 seconds and batches everything, sips battery and feels broken.

Real systems pick a different point on the curve for each event type:

  • Messages: pushed immediately, because users notice the lag
  • Typing indicators: debounced on the client, rate-limited on the server
  • Presence: coarse and periodic, often a refresh rather than an event stream
  • Read receipts: batched while a user is actively reading, individual otherwise

The product feels real-time on the events that matter and quietly conserves resources on the ones that don't.

Write fan-out vs. read fan-out

We talked about this earlier, but it's the architectural choice with the biggest long-term impact, and the one that's hardest to undo. Write fan-out gives you fast reads and easy unread counts; read fan-out gives you cheap writes and no storage amplification. Most production systems end up with a hybrid (write fan-out for DMs and small groups, read fan-out for large channels), with the threshold between them as a configuration rather than an architectural axiom.

If you're starting from scratch, design for the hybrid from day one. Migrating a system with billions of messages from one model to another is genuinely painful, and "we'll figure it out later" often means "we'll write a migration tool in three years."

Build vs. buy

The core message flow (a WebSocket server, a database, and a fan-out function) can be implemented in a few weeks. The edge cases take years:

  • Mobile reconnection and gap recovery across iOS, Android, and web
  • Group fan-out that scales from 2 to 200,000 members without falling over
  • Sync correctness across multiple devices, including ones offline for weeks
  • Presence, typing, and receipts at scale
  • Push notification integration with APNs and FCM, including delivery tracking and token rotation
  • Moderation tooling and the operational workflows around it
  • SDKs for every supported platform, kept current as those platforms evolve
  • Observability, debugging, and the operational knowledge to run the system through incidents

This is where managed chat providers earn their place. Stream ships iOS, Android, React, React Native, and Flutter SDKs that handle the client-side state machine, plus server-side moderation, channel types, threads, reactions, and push notifications, all running on the edge architecture.

Whether to build or buy comes down to one question: Is chat the product or a feature within it?

BuildBuy
Chat is the product (messenger, community, support tool)✓ Owns the differentiated experienceThe provider's defaults become your ceiling
Chat is a feature (marketplace, healthcare, game)Engineering investment rarely pays back✓ Focus on what makes the product yours

Deceptively Simple, Straightforwardly Hard

Chat looks simple from the outside. Every subsystem has a long tail of edge cases that show up when you have real users on real networks.

The architecture in this article is the baseline for a modern chat system. It's a starting point, not a complete specification. Real implementations spend years on the details: the exact sharding strategy, the failure modes of each service, the client-side state machines, the moderation tooling, the observability stack. Every production chat system has its own accumulated scar tissue around these problems.

If you'd rather focus on the parts of your product that aren't chat, this is the architecture Stream implements behind its chat SDKs. You get the WebSocket fleet, fan-out, sync, and SDKs without having to run any of it yourself.

Stream Chat: 5M User Benchmark Results
View the docs to learn more about Stream Chat's architecture and 5M-concurrent-connection performance benchmark.
Explore Chat Benchmark