# AI Message Streaming

When you connect a Large Language Model (LLM) to a Stream Chat channel, the LLM emits a long sequence of small text deltas (tokens or content blocks) before the response is complete. Streaming those deltas into a single chat message, instead of posting one message per chunk or waiting for the full response, gives users the ChatGPT-style live typing experience.

Stream's partial update API is the building block for this pattern. This page explains the two endpoints involved, how to chunk LLM output into a single message, and how to keep that updating message cheap to produce at scale.

## How streaming works

The flow on the server is the same regardless of LLM provider:

1. A user sends a message in a channel that has an AI bot as a member.
2. Your backend receives the new message (over webhook or SQS), forwards it to the LLM, and posts an empty placeholder message to the channel as the bot.
3. As the LLM streams content blocks back, your backend appends them to a local buffer and patches the placeholder message with the latest text using `update_message_partial` or its ephemeral variant.
4. When the LLM finishes, your backend writes the final text with a regular `update_message_partial` call so the message persists, then sends an `ai_indicator.clear` event.

Connected clients receive a `message.updated` event for every patch and re-render the message in place.

## Quick example

The minimal Python loop looks like this. Each `chunk_counter % 20` flush keeps the request rate well below per-channel limits while the message still appears to type smoothly.

```python label="Python"
from getstream import Stream
from getstream.models import MessageRequest

client = Stream(api_key="...", api_secret="...")
channel = client.chat.channel("messaging", channel_id)

placeholder = channel.send_message(
    MessageRequest(text="", user_id=bot_id, custom={"ai_generated": True}),
).data.message

buffer = ""
chunk_counter = 0

async for delta in llm_stream:
    buffer += delta.text
    chunk_counter += 1

    if chunk_counter % 20 == 0:
        client.chat.ephemeral_message_update(
            id=placeholder.id,
            set={"text": buffer, "generating": True},
            user_id=bot_id,
        )

client.chat.update_message_partial(
    id=placeholder.id,
    set={"text": buffer, "generating": False},
    user_id=bot_id,
)
```

The interim updates use the ephemeral endpoint, which is the cheap path described below. The final update uses the regular partial endpoint so the assistant's reply is persisted and shows up in channel history.

## Persisted vs ephemeral partial updates

Stream exposes two patch endpoints with the same request shape (`set`, `unset`, `user_id`) but different storage semantics:

| Endpoint                   | Method                           | Persisted?                   | Use it for                                                                        |
| -------------------------- | -------------------------------- | ---------------------------- | --------------------------------------------------------------------------------- |
| `update_message_partial`   | `PATCH /messages/{id}`           | Yes, written to the database | The final commit when the LLM finishes, and any update you want to keep on reload |
| `ephemeral_message_update` | `PATCH /messages/{id}/ephemeral` | No, kept in memory only      | Interim chunks streamed during generation                                         |

Both endpoints fan out the same `message.updated` WebSocket event to every client watching the channel, so the live-typing experience is identical. The ephemeral variant skips the database write, which is the property that makes it scalable for high-frequency token streams: a 2,000-token response no longer translates into 2,000 (or even 100) row writes per assistant turn.

The persisted variant is still required for the final state. Without it, anyone reloading the channel after the response completed would see the empty placeholder rather than the assistant's reply.

<admonition type="info">

The ephemeral update endpoint is server-side only. It is exposed by the unified server SDKs (Python, Node, Go, Java, .NET, Ruby, PHP) but not by the client-side chat SDKs. Drive streaming from your backend, not from a browser or mobile client.

</admonition>

## Throttle interim updates

Even with the ephemeral endpoint, sending one update per token wastes bandwidth on every connected device and can hit per-channel rate limits. Two strategies that work well in production:

- **Chunk count**: flush every Nth content block (20 is a reasonable starting point for Anthropic and OpenAI streams).
- **Time window**: flush at most every 50ms to 100ms, regardless of how many deltas have arrived.

Pick whichever maps cleanly to your provider's stream events. The chunk-count strategy is shown in the Quick example above; a time-based variant uses a debounced flush coroutine.

## Persisted partial update

Use this for the final commit, and any time you want the change to survive a reload. The same call is documented in [Messages Overview](/chat/docs/php/send-message/#partial-update) for non-AI use cases such as editing custom data.

<Tabs>

```python label="Python"
client.chat.update_message_partial(
    id=msg_id,
    set={"text": final_text, "generating": False},
    user_id=bot_id,
)
```

```js label="JavaScript"
await client.partialUpdateMessage(
  msgId,
  { set: { text: finalText, generating: false } },
  botId,
);
```

```go label="Go"
_, err := client.Chat().UpdateMessagePartial(ctx, msgID, &getstream.UpdateMessagePartialRequest{
    Set: map[string]any{
        "text":       finalText,
        "generating": false,
    },
    UserID: getstream.PtrTo(botID),
})
```

```csharp label="C#"
await chat.UpdateMessagePartialAsync(messageId, new UpdateMessagePartialRequest
{
    Set = new Dictionary<string, object>
    {
        { "text", finalText },
        { "generating", false },
    },
    UserID = botId,
});
```

```java label="Java"
chat.updateMessagePartial(messageId, UpdateMessagePartialRequest.builder()
    .set(Map.of("text", finalText, "generating", false))
    .userID(botId)
    .build()).execute();
```

```ruby label="Ruby"
client.chat.update_message_partial(msg_id, Models::UpdateMessagePartialRequest.new(
  set: { 'text' => final_text, 'generating' => false },
  user_id: bot_id,
))
```

```php label="PHP"
$client->updateMessagePartial($messageId, new Models\UpdateMessagePartialRequest(
    set: (object)["text" => $finalText, "generating" => false],
    userID: $botId,
));
```

</Tabs>

## Ephemeral partial update

Use this on every interim chunk. The request body is identical to `update_message_partial`, but the message is not written to the database; only the `message.updated` event is dispatched to watching clients.

<Tabs>

```python label="Python"
client.chat.ephemeral_message_update(
    id=msg_id,
    set={"text": buffer, "generating": True},
    user_id=bot_id,
)
```

```js label="JavaScript"
await client.ephemeralUpdateMessage(
  msgId,
  { set: { text: buffer, generating: true } },
  botId,
);
```

```go label="Go"
_, err := client.Chat().EphemeralMessageUpdate(ctx, msgID, &getstream.EphemeralMessageUpdateRequest{
    Set: map[string]any{
        "text":       buffer,
        "generating": true,
    },
    UserID: getstream.PtrTo(botID),
})
```

```csharp label="C#"
await chat.EphemeralMessageUpdateAsync(messageId, new UpdateMessagePartialRequest
{
    Set = new Dictionary<string, object>
    {
        { "text", buffer },
        { "generating", true },
    },
    UserID = botId,
});
```

```java label="Java"
chat.ephemeralMessageUpdate(messageId, EphemeralMessageUpdateRequest.builder()
    .set(Map.of("text", buffer, "generating", true))
    .userID(botId)
    .build()).execute();
```

```ruby label="Ruby"
client.chat.ephemeral_message_update(msg_id, Models::UpdateMessagePartialRequest.new(
  set: { 'text' => buffer, 'generating' => true },
  user_id: bot_id,
))
```

```php label="PHP"
$client->ephemeralMessageUpdate($messageId, new Models\UpdateMessagePartialRequest(
    set: (object)["text" => $buffer, "generating" => true],
    userID: $botId,
));
```

</Tabs>

## Pairing with AI indicator events

A good streaming UI signals state in addition to text. Stream Chat ships three custom events that the official AI components ([iOS](/chat/docs/sdk/ios/ai-integrations/overview/), [Android](/chat/docs/sdk/android/ai-integrations/overview/), [React](/chat/docs/sdk/react/guides/ai-integrations/), [React Native](/chat/docs/sdk/react-native/guides/ai-integrations/)) listen for:

- `ai_indicator.update` with one of `AI_STATE_THINKING`, `AI_STATE_GENERATING`, `AI_STATE_EXTERNAL_SOURCES`, or `AI_STATE_ERROR`.
- `ai_indicator.clear` to remove the indicator after the final partial update lands.
- `ai_indicator.stop` for cancellation; the SDK aborts the underlying LLM stream when it sees this from the user.

A typical sequence is `AI_STATE_THINKING` (sent right after the placeholder), `AI_STATE_GENERATING` (sent on the first content block), then `ai_indicator.clear` after the final commit.

## Best practices

- Send a placeholder with `text: ""` and `ai_generated: true` before starting the stream. Updating an existing message ID is far cheaper than creating one per chunk, and clients can render the typing indicator against a known message.
- Drive every interim flush through `ephemeral_message_update`. Reserve `update_message_partial` for the placeholder creation and the final commit.
- Throttle to roughly 5 to 20 updates per second. Lower frequencies feel laggy; higher frequencies do not perceptibly improve the UX and burn rate-limit budget.
- Set a custom field such as `generating: true` while the response is in flight, and clear it on the final commit. Client UIs use this to render the typing animation without depending on indicator events alone.
- Tag the assistant message with `ai_generated: true` so your webhook handler skips it instead of re-prompting the LLM with its own output.
- For multi-channel deployments, share a single bot user across channels but scope cancellation tokens to `(channel_id, message_id)` so a stop in one channel does not abort a parallel response elsewhere.

## End-to-end example

The [Build an AI Assistant Using Python](https://getstream.io/blog/python-assistant/) tutorial walks through the full FastAPI server with Anthropic streaming, including bot lifecycle, webhook registration, and cancellation. The patterns shown there apply unchanged to OpenAI, Gemini, or any provider that exposes a streaming API.

For a managed alternative that wires this whole flow up for you, see the [Stream Chat AI SDK](/chat/docs/sdk/react/guides/ai-integrations/stream-chat-ai-sdk/) and the [Stream Chat LangChain SDK](/chat/docs/sdk/react/guides/ai-integrations/stream-chat-langchain-sdk/). Both implement chunked partial updates, AI indicator events, and cancellation on top of the endpoints documented above.


---

This page was last updated at 2026-05-27T11:59:02.636Z.

For the most recent version of this documentation, visit [https://getstream.io/chat/docs/php/ai-message-streaming/](https://getstream.io/chat/docs/php/ai-message-streaming/).