Designing Resilient Distributed Systems for Mobile Applications

Most of what you read about distributed systems assumes stable network links, synchronized clocks, and long-lived processes. Then mobile comes along.

Mobile clients drop connections during cell handoffs, get killed by the OS to save battery, run clocks that drift by hundreds of milliseconds, and reconnect in synchronized waves after every outage. The failure modes differ sufficiently from those of server-to-server communication to warrant dedicated architectural attention.

This article examines five areas where mobile-specific constraints change how you design distributed systems: event sequencing, delivery semantics, consistency trade-offs, backpressure, and failure handling.

Event Sequencing in Distributed Systems

You've got half a dozen engineers in a work group chat. Some are in the office, one at home, one on the train, one on a plane. They're triaging a payments outage. The conversation, in the order it actually happened:

Justin is on a congested cellular connection from the train. His messages take 300-400ms to reach the server. Everyone else is on WiFi with sub-20ms latency. If the server sequences by arrival order, everyone else in the channel sees this:

Justin's messages all arrive in a clump after the questions have piled up. "Not yet, two minutes" looks like a response to Priya's merge question. "Go for it" reads as approval for freezing all deploys. Priya holds her merge. Sara freezes everything. Mike, having heard nothing to the contrary, already told customers the team is working on it without knowing whether anyone's actually on it.

Twenty minutes later, after the incident channel has devolved into people undoing each other's work, Marcus, on airplane WiFi, wants to know if he should pick up anything for the team on the way in, oblivious to the chaos that has emerged.

This is a sequencing problem, and on mobile it compounds with every participant. Each client has a different network path to the server, with latency that varies moment to moment. Any system that relies on arrival order to determine display order will produce garbled conversations under real-world conditions, and the more people in the chat, the worse it gets.

Server-side sequencing vs client timestamps

The obvious solution is to use the client's timestamp. The client knows exactly when the user hit send.

The problem is that mobile clocks are unreliable. Android devices with automatic time enabled still drift by several seconds, and there's nothing to stop a user from setting their clock to last Thursday. Server timestamps can work for ordering, but can produce inaccurate results due to clock skew and network delays. Client timestamps shouldn't be used at all.

So the server has to own ordering. There are a few ways to do it:

Global sequence numbers: An atomic counter (Redis INCR, database sequence) stamps each event with a monotonically increasing ID. Simple and correct, but the counter is a single point of contention at high throughput.
Snowflake IDs: Discord packs a 64-bit ID with 41 bits for the timestamp, 10 bits for the machine ID, and 12 bits for the sequence number. These sort chronologically without a centralized counter and handle the case where two messages have the same creation time.
Logical clocks with server annotation: Slack takes a hybrid approach. The client sends an incremented count of the last server response with each new message, giving the server enough context to assign authoritative ordering even when messages arrive out of sequence.

How far off are mobile clocks?

UW-Madison measured this. Over 95% of mobile clients use Simple Network Time Protocol (SNTP) rather than full NTP. Full NTP cross-references multiple time sources and filters out network jitter. SNTP skips all of that, and the accuracy gap is significant: wireless SNTP had a mean offset of 31ms with a standard deviation of 47ms. Wired connections averaged 4ms. The worst wireless offsets reached 450ms.

Android's implementation is well-documented. Starting with Android 12, AOSP prioritizes NTP over carrier-provided time (NITZ) and uses SNTP with a default timeout of 5000ms. The AOSP documentation is explicit about the limitation: if network latency is asymmetric, meaning the request takes a different amount of time than the response, the theoretical error can reach approximately 2.5 seconds. Apple doesn't publish the same level of detail about iOS time synchronization, though iOS devices are known to use time.apple.com as their NTP source and generally exhibit lower jitter in practice.

Without any network synchronization at all, such as in airplane mode, quartz crystal drift accumulates to roughly a second per day.

Logical clocks for mobile

If physical clocks can't be trusted, the alternative is to track causality directly.

Lamport clocks do this with O(1) space per timestamp, but they can only tell you that event A happened before event B. They can't tell you when two events are concurrent, which matters when two users are typing at the same time.
Vector clocks solve that, but require O(N) space per timestamp, where N is the number of participants. That works for a system with a handful of servers. It doesn't work for a group chat with 50 people.
Hybrid Logical Clocks thread the needle. They combine a physical time component with a small logical counter, giving you causal ordering in O(1) space while staying close to wall-clock time. They're monotonic, so they handle the case where NTP corrects a clock backward. They fit into a standard 64-bit format (48 bits physical, 16 bits logical). And they're already running in production at CockroachDB, MongoDB, and YugabyteDB.

Reliable event systems often centralize sequencing logic at the infrastructure layer to prevent race conditions across clients.

There's a gotcha for offline-first apps, though. If one client's clock runs a day ahead and it sends changes, other clients can't modify the same object until the next day. Their HLC values will always be lower. The clock skew propagates through the logical layer.

Delivery Semantics and Reliability Models

Exactly-once delivery is provably impossible at the network layer. This isn't a gap waiting to be closed. The Two Generals' Problem and the FLP impossibility result are mathematical proofs that it can't be done. The workaround is well understood: deliver at least once, make your consumers idempotent, and you get effectively once processing.

That's manageable between servers. Mobile is a different environment.

Why mobile makes delivery harder

Think about what happens during a cell handoff. Your phone is connected to one tower. You're moving, and the signal from a second tower gets stronger. For a brief window, the connection to the first tower is torn down before the second one is fully established. Any packets in flight during that window are gone. WiFi-to-cellular transitions are worse: the TCP connection gets reset entirely.

Uber's engineering team noted that temporary DNS errors or connection timeouts on a mobile device usually don't mean the backend is down. They mean the phone briefly lost signal. The problem is that from the client's perspective, a network blip and a server outage look the same. Your retry logic has to handle both without knowing which one it's dealing with.

Then the OS gets involved. iOS treats silent push notifications as best-effort, throttling to roughly two or three per hour and blocking them entirely in Low Power Mode. Android's Doze mode restricts network access and defers jobs. High-priority FCM messages get about a 10-second window, but apps that overuse high priority get deprioritized. The difference is measurable: devices with selective battery exemptions hit 98.3% push reliability over 72 hours, compared to 61.7% with blanket battery restrictions.

Retries, backoff, and jitter

When a request fails, the instinct is to retry immediately. That can make things worse. If 100 clients fail at the same time and all retry at the same time, you've just doubled the load on an already struggling server.

Exponential backoff helps, but not as much as you'd expect. AWS simulated 100 clients contending for the same resource and found that plain exponential backoff still produced clusters of retries, because every client backed off to the same intervals. The calls happened less frequently, but they still happened in spikes.

The fix is adding randomness. Full Jitter (sleep = random(0, min(cap, base * 2^attempt))) spreads retries across the entire backoff window instead of clustering them at the boundaries. Without jitter, retries clump into spikes with dead gaps between them. With Full Jitter, the spikes flatten into an approximately constant rate. Total call count dropped by more than half.

The deeper danger with retries is amplification. A backend with 5 layers, where each layer retries 3 times, turns a single failed request into 243 requests at the bottom of the stack. On mobile, the client adds yet another retry layer, and amplification starts before the request even reaches your infrastructure.

Mobile retries also need to survive app crashes, which means the retry state can't just live in memory. Generate the idempotency key client-side, scoped to user intent (one per button tap, not per HTTP request).

Persist it to local storage before the first network attempt. If the app crashes after sending but before getting a response, the restart path finds the pending key and retries with it. Cap retries at 10-15 attempts or 2-5 minutes, because unbounded retries on mobile just drain the battery without accomplishing anything.

Idempotency strategies

Stripe's idempotency key pattern is the industry standard: keys expire after 24 hours, cost about 50-100 bytes each, and the server-side logic is straightforward. The key requirement on mobile is ordering: persist the key before sending the request, not after. Here's what the pattern looks like on iOS:

swift

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
func placeOrder(_ order: Order) async throws -> OrderConfirmation {
    // Generate and persist key BEFORE the network call
    let idempotencyKey = UUID().uuidString
    UserDefaults.standard.set(
        idempotencyKey,
        forKey: "pending_order_key"
    )

    // If we crash after this point, the key survives
    let confirmation = try await api.post(
        "/orders",
        body: order,
        headers: ["Idempotency-Key": idempotencyKey]
    )

    // Only clear the key after success
    UserDefaults.standard.removeObject(
        forKey: "pending_order_key"
    )
    return confirmation
}

func recoverPendingOrder() async {
    // On app restart, check for incomplete requests
    guard let pendingKey = UserDefaults.standard.string(
        forKey: "pending_order_key"
    ) else { return }

    // Retry with the same key — server deduplicates
    try? await api.post(
        "/orders",
        body: lastOrder,
        headers: ["Idempotency-Key": pendingKey]
    )
}

The server side is straightforward. When a request arrives with an idempotency key, check if you've seen it before. If not, process the request and store the result keyed to it. If so, return the stored result. INSERT ... ON CONFLICT DO NOTHING handles the atomic check-and-insert in a single query.

One mistake that comes up often: don't cache error responses. If you store a 500 with an idempotency key, every retry with that key returns the cached 500 instead of retrying. A transient server error becomes a permanent failure for that client. Only cache 2xx responses.

Mature real-time platforms typically incorporate built-in idempotency and retry handling to reduce duplicate or dropped events.

When idempotency, acknowledgment tracking, and retry logic are handled at the infrastructure layer, application developers can focus on business logic rather than reimplementing delivery guarantees.

Consistency Tradeoffs in Distributed State

Mobile clients spend a lot of time partitioned from your backend. Sometimes it's a full disconnection. Sometimes it's a cell connection that's technically alive but dropping half its packets. Either way, the app needs to keep working. That means you're choosing availability over consistency, and the question becomes: how much consistency can you afford to give up, and where?

The CAP theorem frames this as a binary choice during partitions. The PACELC extension is more useful for mobile because it also covers what happens the rest of the time. Even when a mobile client has a connection, cellular latency is high and variable.

Every read and write path involves a tradeoff: do you wait 400ms for a consistent read from the primary, or serve stale data from a local cache in 20ms? Different features in the same app need different answers.

Segmenting consistency requirements

Don't apply one consistency model to your whole app. Segment by feature, and match each domain to what it actually needs:

Feature Type	Consistency Model	Rationale
Social feeds, activity streams	Eventual consistency (AP/EL)	Stale reads are fine; availability and low latency matter more
Chat message ordering	Strong ordering, eventual delivery	Server-assigned sequence numbers with client-side gap detection
Payments, transactions	Strong consistency (PC/EC)	Correctness is non-negotiable; users accept higher latency
Collaborative editing	Strong Eventual Consistency via CRDTs	Offline writes need to merge without conflicts

In practice, this means a single mobile app might run three or four consistency strategies at once, coordinated at the API gateway or service mesh level.

Building your own app? Get access to our Livestream or Video Calling API and launch in days!

CRDTs for offline-first mobile

The hard problem with offline-first is: what happens when two users edit the same thing while disconnected? When they come back online, the system has two divergent versions and no way to ask the users what they meant.

CRDTs (Conflict-free Replicated Data Types) are designed for exactly this. They're data structures where any two replicas can be merged automatically, without coordination, and the result is always consistent. The original research was motivated by collaborative editing and mobile computing. There are three main flavors, each with different bandwidth profiles:

State-based (CvRDTs): Send full state on every sync. Simple, but expensive over cellular.
Operation-based (CmRDTs): Send only the operations. Smaller payloads, but require causal delivery.
Delta-state: Send only recent changes. Usually, the best fit for mobile.

In practice, pure CRDTs are rarely the answer. Figma's experience is the clearest illustration. They evaluated the full decision space:

OT (Operational Transforms). The approach Google Docs uses. Rejected because it requires a continuous connection, ruling out offline use.
Pure CRDTs. Rejected because CRDTs are designed for fully decentralized systems. Figma has a central server, so the decentralization machinery added complexity without benefit.
Property-level last-writer-wins with server-defined event order. This is what they shipped. Simpler than either alternative, and it works because they have a single source of truth for ordering.

The broader lesson for mobile architects: CRDTs solve conflict resolution, but they don't solve sync. They don't tell you when to sync, how to manage bandwidth, or how to decide what data each device needs.

Session guarantees and cross-device sync

Read Your Writes, Monotonic Reads, Writes Follow Reads, and Monotonic Writes. These are the four session guarantees specifically for intermittently connected mobile users.

Read-your-own-writes matters the most for UX, and it's easy to see why. A user sends a message in a chat. The write goes to the primary database. The client immediately reads from a replica that hasn't received the write yet. The message doesn't appear. The user assumes the send failed and taps the button again. Now you have a duplicate, and the user trusts your app a little less. One missing guarantee creates both a data integrity problem and a trust problem.

Cross-device sync is harder still, because the code on one device has no idea what happened on the other. Linear's sync engine is a good reference implementation, but the most interesting thing about it is what they learned: conflicts are actually rare in practice. They use last-writer-wins for most fields and only recently added CRDTs for rich text in issue descriptions. For most mobile apps, having a clear conflict resolution strategy matters more than which strategy you pick.

Backpressure and Flow Control in High-Volume Systems

A normal API call is a brief spike. The request goes out, the response comes back, and the resources get freed. A persistent connection is different. It stays open, and everything it touches accumulates: the event buffer grows, file descriptors stay reserved, heartbeats burn CPU on every tick, and the cellular radio stays in its high-power state as long as data is flowing.

Over hours of a user session, this sustained pressure can push a mobile client toward memory limits that a thousand short-lived API calls would never approach.

The WebSocket flow control gap

When a server pushes events faster than a mobile client can process them, the client has no way to say "slow down." WebSockets have no backpressure mechanism, no equivalent to HTTP/2's WINDOW_UPDATE. The client just buffers until memory runs out or the UI locks up.

The server side has the same problem in reverse. Each connected socket needs a send buffer, and across thousands of concurrent connections, buffer memory alone can consume gigabytes of RAM.

HTTP/2 and WebTransport (HTTP/3/QUIC) both provide native flow control, but until those see wide adoption on mobile, you have to implement application-layer backpressure yourself. That usually means some combination of bounded queue depths, priority-based dropping, and signaling the source to slow down.

Mobile backgrounding destroys persistent connections

iOS kills WebSocket connections when the app goes to the background. You get roughly 30 seconds via beginBackgroundTask to clean up. The recovery cycle is:

APNs silent push wakes the app
App reconnects the WebSocket briefly to sync missed state
App closes the connection before the OS kills it

Android's Doze mode takes a different approach but has the same effect: it cuts off network access and defers jobs and syncs. FCM is the only background delivery channel that reliably works on Android, because it maintains a single persistent connection optimized for Doze and App Standby.

The takeaway from both platforms is the same: you can't keep a persistent connection to a mobile client. Constant reconnection is the normal state of affairs, and your architecture has to be built for it.

Adaptive delivery mechanisms

Flow control on mobile isn't one mechanism. It's several strategies layered together.

Priority classification is the one most teams get wrong. The idea is straightforward: separate time-sensitive events (new messages, typing indicators) from deferrable ones (read receipts, presence updates). FCM high-priority messages bypass Doze; normal-priority messages wait for maintenance windows. The temptation is to mark everything as high priority to guarantee delivery, but Android deprioritizes apps that overuse it. Once you're flagged, even your genuinely urgent messages get delayed.

Collapsible vs non-collapsible messages keep queues from bloating. A "sync new email" notification replaces the previous identical one (collapsible). Chat messages each need individual delivery (non-collapsible). FCM enforces hard limits on both:

240 messages per minute per device
5,000 messages per hour per device
100 pending messages per offline device, after which all stored messages are discarded

Server-side flow control mechanisms are commonly used in large-scale event platforms to protect downstream systems and clients during load spikes.

When that 100-message cap is hit, onDeletedMessages fires on the client. Your app needs a full resync path for any device that's been offline long enough to reach it.

Adaptive heartbeats save battery by widening intervals during idle periods and tightening them when the user is active. Batching state diffs at short intervals and dropping stale data before sending reduces both network traffic and client-side processing load.

Designing for Failure as a First-Class Concern

A user taps "Place Order" on a food delivery app. The request reaches the server and succeeds. The server charges their card, creates the order, and sends a response. But the phone switched from WiFi to cellular at exactly the wrong moment, and the response never arrives. The app shows a spinner, then times out. The user sees the order screen again, with no confirmation. They tap "Place Order" a second time.

Without idempotency, they just got charged twice. Without a saga, the second order has no compensating transaction to undo the first. Without client-side state persistence, the app has no idea that the first attempt ever happened. This is a partial failure, and on mobile, a primary operating mode.

Partial failures and compensating transactions

The saga pattern handles multi-step operations that can fail partway through by pairing each step with a compensating transaction that undoes its effect. In the food delivery example, the saga would look like this:

Create order (compensate: cancel order)
Charge payment (compensate: refund)
Notify restaurant (compensate: send cancellation)

If the app disconnects after step 2, the server knows exactly where it stopped. When the client reconnects with the same idempotency key, the server can either resume from step 3 or confirm that the order has already been completed. No double charge, no orphaned order.

On mobile, server-side orchestration is essential. Client-side choreography falls apart when the phone can disconnect at any point in the sequence and lose track of what has been completed. All mutating operations require idempotency keys, and saga state must be durable on both sides.

Retry storms and the thundering herd

When thousands of mobile clients reconnect simultaneously after an outage, their synchronized retries can extend the outage far beyond the original cause. Fixing this requires work at several levels:

Full Jitter on reconnection timers: Random delay spread across the full backoff window prevents synchronized spikes. The AWS analysis shows this cuts total server load by more than half with 100 contending clients.
Maximum retry bounds: Cap at 10-15 attempts or 2-5 minutes elapsed. Unbounded retries drain batteries and pile onto an already struggling server.
Server-side 429 responses with Retry-After headers: Clients should respect the header rather than computing their own backoff.

Jitter helps everywhere, not just on retries. Heartbeat timers, background sync intervals, polling cycles. Any regular interval becomes a synchronized spike across a large enough client population.

Fault isolation on mobile

Circuit breakers cycle through three states: Closed (normal), Open (fail fast), and Half-Open (test recovery with a single request). Capital One's mobile Edge team uses this to cut off calls that take too long, failing fast instead of letting the client hang.

But on mobile, you need an extra check. The circuit breaker shouldn't trip during airplane mode or no-signal conditions. If it opens due to a client-side connectivity issue, it'll prevent recovery when the signal comes back. Only trip on genuine backend failures: 5xx responses, connection refused. The mobile networking layer needs to classify errors before the circuit breaker sees them.

Bulkheads complement this by giving each backend service its own connection pool and timeout config. Without them, one slow service can block everything else by exhausting shared resources.

Uber's mobile failover architecture

Uber published one of the most detailed public accounts of mobile failover. Their handler sits in the networking stack as an interceptor above HTTP/2 and QUIC, running as a finite state machine:

The hard part was telling the difference between the phone losing signal and Uber's edge infrastructure going down. On cellular networks, the error signatures overlap. The results: 25-30% reduction in tail-end HTTPS latencies and low error rates during cloud outages.

Observability on mobile

Backend observability is a solved problem in most organizations. You have distributed tracing, structured logging, dashboards, and alerting. The gap is that all of those tools stop at the edge of your infrastructure. A green backend dashboard can hide a terrible user experience when the real problem lies in the network between the client and the server.

Mobile observability requires a different set of signals. You need to know what the user actually experienced, not just what the server processed:

Client-side latency: The time from the user tapping a button to seeing a result, including DNS resolution, TLS handshake, request queuing, and render time. Backend p99 might be 50ms while the user waits 3 seconds.
Network transition failures: How often do requests fail during WiFi-to-cellular handoffs, and how long does recovery take? These don't show up in server logs at all.
Retry and reconnection patterns: How many retries does a typical session generate? How long do clients spend in a disconnected state? If your retry budget is being exhausted regularly, the backoff parameters need to be tuned.
Background sync gaps: How often do devices hit the FCM 100-message cap and trigger full resyncs? That's a leading indicator of notification volume problems.

Backend distributed traces show how one request moves through your services. A mobile performance trace is different: it starts and ends on the user's device and can span multiple backend requests, each with its own distributed trace. Don't merge them. They answer different questions, and combining them creates noise rather than insight.

The hardest part of mobile resilience is that many failures are invisible to backend monitoring. Uber's chaos testing program demonstrated this at scale, running over 180,000 automated chaos tests across 47 critical flows in their Rider, Driver, and Eats apps.

The biggest finding was that 70% of resilience risks involved architectural dependency violations, in which non-critical services degraded core user flows. Twelve issues were severe enough to block trip requests or food orders. Two caused application crashes that were only detectable through mobile-side testing, not backend monitoring.

Everything fails everywhere all at once

The thread running through all five sections is the same: mobile clients are unreliable participants in your distributed system, and the architecture must account for this at every layer. Server-side sequencing because you can't trust client clocks. At-least-once delivery with idempotency because you can't guarantee a response will arrive. Segmented consistency because you can't assume a stable connection. Application-layer backpressure because the protocol won't do it for you. And failure handling that treats disconnection as the norm rather than the exception.

None of these problems are new. What's different on mobile is that they all happen at once for every user, in every session.

Interactive systems are only as reliable as the infrastructure coordinating their state and event delivery, which is why many teams evaluate managed platforms when reliability becomes mission-critical.