LLM Prompt Engineering for Content Moderation

In our previous blog post, we created an agent using OpenAI GPT that can understand whether or not messages contain spam. We connected it to the Stream Chat API so that incoming messages in chat can be automatically flagged and reviewed by moderators using Stream Moderation Dashboard.

To enhance the accuracy of our agent and its ability to detect spam accurately, we need to refine our prompt using prompt engineering—an essential technique for fine-tuning the responses of large language models. In this blog post, we will see how we can improve this setup to obtain better results via this and other techniques.

The following sections contain many code snippets that can be executed in a notebook environment, such as Google Colab. We will use the same libraries and setup used in the previous blog post and build up from there, so go check it out if you haven’t done it already. You will find links and instructions to get you started.

Let’s rewrite what we gathered in the previous post regarding how to evaluate the prompt more compactly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from sklearn.metrics import accuracy_score

# Setup OpenAI prompt
chat_template = ChatPromptTemplate.from_messages(
    [
        ("system", """Is this message spam? Return 1 or 0"""),
        ("human", "{input}"),
    ]
)

llm = ChatOpenAI(
    openai_api_key=your_openai_api_key,
    model="gpt-4o-mini",
    max_tokens=1
)

llm_chain = chat_template | llm

def get_data_sample(sample_size=100):
    """
    Fetches and samples the SMS spam dataset.
    """
    url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
    df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])
    df['label'] = df['label'].apply(lambda x: 1 if x == 'spam' else 0)

    # Sample specified number of negative and positive examples without replacement
    negatives = df[df['label'] == 0].sample(sample_size//2, random_state=42)
    positives = df[df['label'] == 1].sample(sample_size//2, random_state=42)
    df_sampled = pd.concat([negatives, positives]).reset_index(drop=True)
    return df_sampled

def predict_message(llm_chain, message):
    """
    Predicts whether a given message is spam or not using the language model chain.
    """
    try:
        return int(llm_chain.invoke(message).content)
    except ValueError:
        return 0

def get_spam_predictions(llm_chain, messages):
    """
    Predicts whether each message in a list is spam or not.
    """
    predictions = []
    total = len(messages)
    for i, message in enumerate(messages):
        # Print progress every 10%
        if i % (total // 10) == 0:
            print(f"Processing: {i}/{total}")
        predictions.append(predict_message(llm_chain, message))
    return predictions

# Get predictions and calculate accuracy
df = get_data_sample(sample_size=600)
messages = df['message'].to_list()
df['predicted'] = get_spam_predictions(llm_chain, messages)

accuracy = accuracy_score(df['label'], df['predicted'])
print(f"Accuracy: {accuracy}")

Output:

1
Accuracy: 0.898

In this setup, we gathered a sample dataset, evaluated each message with a prompt, and calculated the accuracy score. We have also increased the sample size to 600 messages to have a more reliable metric. Next, let's explore techniques to improve accuracy.

First of all, let us see which predictions the model gets wrong.

1
2
3
4
5
6
7
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(df['label'], df['predicted'])
print("True Negative  (TN):", confusion_matrix[0, 0])
print("False Positive (FP):", confusion_matrix[0, 1])
print("False Negative (FN):", confusion_matrix[1, 0])
print("True Positive  (TP):", confusion_matrix[1, 1])

Output:

1
2
3
4
True Negative  (TN): 164
False Positive (FP): 36
False Negative (FN): 3
True Positive  (TP): 197

We have 36 false positives and 3 false negatives. Let’s sample some of the false positives, which we can retrieve by checking the rows where the predicted label is greater than the true label and extracting the content of the “message” column.

1
df[df['label'] < df['predicted']].sample(5)['message'].values

Output:

1
2
3
4
5
array(['K ill drink.pa then what doing. I need srs model pls send it to my mail id pa.',
       "Send ur birthdate with month and year, I will tel u ur LIFE PARTNER'S name. and the method of calculation. Reply must.",
       'Hey ! I want you ! I crave you ! I miss you ! I need you ! I love you, Ahmad Saeed al Hallaq ...',
       'Pls give her prometazine syrup. 5mls then  <#> mins later feed.',
       'Hey you can pay. With salary de. Only  <#> .'],dtype=object)

If we run this command a few times, we can see that some of these messages are ambiguous, talk about topics frequently found in spam messages, or contain a lot of typos. Let’s see how we can improve this by prompt engineering and other techniques.

Prompt Engineering

Prompt engineering is designing, refining, and optimizing prompts given to language models to elicit specific, desired responses. This process involves carefully crafting the input text provided to the model to ensure the output aligns with the user's objectives, like:

Clarifying Instructions: Specifying clearly what the model should do, often by setting up a scenario or providing detailed instructions.
Few-Shot Learning: Including examples in the prompt to demonstrate the desired output style or response format.
Parameter Tuning: Adjusting parameters like temperature, max tokens, and top-p to control the randomness, length, and scope of the responses.

Our initial prompt is not very detailed:

“Is this message spam? Return 1 or 0”

Clarify Instructions

In this example, we are trying to classify spam messages. We have provided the model with a basic prompt, relying on its intuition of what “spam” means. However, the policy we want to enforce might differ from the model’s knowledge. For instance, we might decide to allow promotional messages, or not allow them at all. We might also want to include other policies.

Let’s replace the prompt with a clearer version and calculate the predictions again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
chat_template = ChatPromptTemplate.from_messages(
    [
        ("system", """Your task is to act as a content moderator. Review the following message from a chat user and classify it based on the criteria below:
1 - Spam: Unsolicited promotions, advertisements, or deceptive content. 
0 - Non-Spam: Legitimate conversations or unclear cases.
Return only the corresponding number (1 or 0). Any additional information will result in penalties. Ensure your judgment is unbiased and consistent. Think step-by-step to make an accurate classification. Think step-by-step to make an accurate classification.
"""
),
        ("human", "{input}"),
    ]
)

llm_chain = chat_template | llm
df['predicted'] = get_spam_predictions(llm_chain, messages)

accuracy = accuracy_score(df['label'], df['predicted'])
print(f"Accuracy: {accuracy}")

Output:

1
0.977

Awesome, we have increased the accuracy by more than seven percentage points! Let’s unravel what we have done here:

1. Clarified the Task

We have defined the context of the task better, specifying that the model needs to act as a content moderator and that it will receive a message from a chat user. By writing “think step-by-step” we nudge the model to consider every aspect of the input text, which has been shown to improve performance. This prompt engineering technique, known as “Chain of Thought”, typically involves the model explaining its reasoning in its response. However, in this instance, we apply the technique implicitly, asking the model to follow the reasoning process internally and output only the final decision.

2. Improved Formatting

Now that the output is longer, we have improved its formatting. This makes it easier for the model to understand the task and return the correct predictions.

3. Clarified the Definition of Spam

We have specified that spam messages contain promotions and deceptive or unwanted content. By restricting the scope of the definition, we have reduced the number of false positives and increased accuracy, as we have restricted the scope of the definition.

This section of the prompt is very important, as it directly affects the moderation policy you want to enforce. In the case of spam, we did not have a large impact as the model already had a good intuition of what spam means, but if you would need to implement a custom policy, this step is crucial.

We have also defined what should not be considered spam by adding the statement.
”Note that messages with typos or about typical topics seen in spam messages are not necessarily spam.”

We have added this because when looking at false positives earlier, we saw several messages that were about dating or had typos, which are typical of spam messages but were not spam in this case. Having a clear definition of what’s included in spam and what's not helps the model decide in a predictable manner.

Reviewing the wrong predictions to improve the definition is very useful, but you need to be careful. A possible issue is to overfit the prompt to the data it has seen, meaning it will have high accuracy on the current sample but won’t generalize to new data. For this reason, it is important to evaluate the results on a separate test dataset. In this blog post, we are only using a sample of data as we are not iterating much on the prompt, and the definition is still very general. Still, in production, you should always separate the data that you optimize the prompt from the data that you use to evaluate it.

Another approach you can use to improve the policy is to directly ask the model to explain why it took the decision by specifying this instruction in the prompt and evaluate the messages that were wrongly classified (don’t forget to also increase the number of output tokens) again. In this case, it is even more important to evaluate the results on a separate test dataset.

4. Enforced the Output Format

We also notice that in some cases, the prompt does not return the values in the right format. There are a lot of edge cases that can pop up with an LLM, and it is sometimes hard to predict. For instance, we could get a response such as:

Ok... I'm sorry, but I need a longer message to determine if it contains spam or not. Could you please provide a longer message for me to check?

Where the model complains that the message should be longer. Or:

This message is probably not spam but I’m not 100% sure because the language is ambiguous

In both these cases, we cannot extract a response. With this new prompt, we have specified that:

Get started! Activate your free Stream account today and start prototyping your chat app.

For unclear cases, the model should output a 0
There should be nothing else in the response besides the number

We keep the output as a digit, so we only use one token. This helps save costs in the long run.

Let’s review step-by-step all the changes that we have done in the prompt:

Prompt	Change	Accuracy
Is this message spam? Return 1 or 0	Original	0.898
You will receive a message. Return 1 if it contains spam, 0 otherwise.	Clarified instructions	0.951
Your task is to act as a content moderator. Review the following message from a chat user. Return 1 if it contains spam, 0 otherwise.	Clarified role and instructions	0.967
Your task is to act as a content moderator. Review the following message from a chat user and classify it based on the categories below: 1 - Spam 0 - Non-Spam Return only the corresponding number (1 or 0). Any additional information will result in penalties.	Improved formatting and exception handling	0.97
Your task is to act as a content moderator. Review the following message from a chat user and classify it based on the categories below: 1 - Spam 0 - Non-Spam Return only the corresponding number (1 or 0). Any additional information will result in penalties. Think step-by-step to make an accurate classification.	Added reminder to think step-by-step	0.972
Your task is to act as a content moderator. Review the following message from a chat user and classify it based on the categories below: 1 - Spam 0 - Non-Spam Return only the corresponding number (1 or 0). Any additional information will result in penalties. Ensure your judgment is unbiased and consistent. Think step-by-step to make an accurate classification.	Added reminder to stay consistent	0.975
Your task is to act as a content moderator. Review the following message from a chat user and classify it based on the criteria below: 1 - Spam: Unsolicited promotions, advertisements, or deceptive content. Note that messages with typos or about typical topics seen in spam messages are not necessarily spam. 0 - Non-Spam: Legitimate conversations or unclear cases. Return only the corresponding number (1 or 0). Any additional information will result in penalties. Ensure your judgment is unbiased and consistent with the criteria. Think step-by-step to make an accurate classification.	Added spam policy	0.977

It is important to note that we have increased the prompt size this way, increasing the number of input tokens. This can increase the cost, so always balance costs and benefits when improving your prompt. The nice thing about classification tasks is that only one output token is needed, so the cost is kept low, as output tokens are generally more expensive than input tokens.

Parameter Tuning

The main parameter that can be tuned for GPT is the temperature. Temperature controls the “creativity” or randomness of the generated text. It can go from 0 to 2, and the default value is 1. Temperature affects the probability distribution over all the possible tokens that can be chosen (in our case, 1 or 0). With a temperature of 0, the output is deterministic, meaning that the token with the highest probability will always be chosen. If the temperature is higher, we are more likely to sample tokens with lower probability.

So far, we have been using a default value of 1. As our task is a binary classification that benefits from consistency rather than creativity, we can try lowering this to 0:

1
2
3
4
5
6
7
8
9
10
llm = ChatOpenAI(openai_api_key=your_openai_api_key,
    model="gpt-4o-mini",
    temperature=0, # Temperature is set to zero
    max_tokens=1)

llm_chain = chat_template | llm
df['predicted'] = get_spam_predictions(llm_chain, messages)

accuracy = accuracy_score(df['label'], df['predicted'])
print(f"Accuracy: {accuracy}")

Output

Accuracy: 0.98

We don’t have improvements using this prompt and for this dataset. This indicates that our prompt is clear enough, and we almost always have a clear preference for one of the two output values. In general, we do not expect this parameter to matter as much as in text generation, as our output is already very constrained (we can have only two values). If there were more uncertainty or a larger pool of output tokens, temperature would have a larger impact.

We will set the temperature to zero, as a deterministic output is preferred for classification tasks where we don’t need any creative text generation, and it is good practice to set it to a low value in these cases. Other parameters that can be changed are top_p and frequency_penalty. They are less relevant for binary classification tasks such as this one but feel free to experiment with them.

Few-Shot Learning

Few-shot learning consists of presenting the model with examples of inputs and expected outputs. It is particularly useful in cases where the task is new to the model, as it provides examples of expected behaviors and responses. There are different ways to do this. An example can be the following prompt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
chat_template = ChatPromptTemplate.from_messages(
    [
        ("system", """Your task is to act as a content moderator. Review the following message from a chat user and classify it based on the criteria below:
1 - Spam: Unsolicited promotions, advertisements, or deceptive content. Note that messages with typos or about typical topics seen in spam messages are not necessarily spam.
0 - Non-Spam: Legitimate conversations or unclear cases.
Return only the corresponding number (1 or 0). Any additional information will result in penalties.
Ensure your judgment is unbiased and consistent with the criteria. Think step-by-step to make an accurate classification.
"""
),
     ("human", "Baaaaabe! I misss youuuuu ! Where are you ? I have to go and teach my class at 5 ..."),
 ("assistant", "0"),
     ("human", "LIFE has never been this much fun and great until you came in. You made it truly special for me. I won't forget you! enjoy @ one gbp/sms"),
 ("assistant", "1"),
        ("human", "{input}"),
    ]
)

In the case of classification, we do not need to stress much on the specific response format as it is only a number. Therefore, the purpose of adding examples is to clarify the policy and cover possible edge cases rather than teaching the model the output format.

Using a Larger Model

Having a larger model (having more parameters) typically results in a better performance. So far, we have used GPT-4o-mini. Let’s try the larger GPT-4o by changing the instructions to the following:

1
2
3
4
5
6
7
8
9
10
llm = ChatOpenAI(openai_api_key=your_openai_api_key,
    model="gpt-4o",
    temperature=0,
    max_tokens=1)

llm_chain = chat_template | llm
df['predicted'] = get_spam_predictions(llm_chain, messages)

accuracy = accuracy_score(df['label'], df['predicted'])
print(f"Accuracy: {accuracy}")

Output:

1
Accuracy: 0.98

Accuracy has increased a little bit by using GPT-4o with the last prompt. This is great, but it comes at a cost of longer evaluation time and pricing. You need to run more tests and evaluate if the costs outweigh the benefits of having a \<1% increase in accuracy by asking questions such as:

What is the impact associated with an error? How many can you tolerate?
What is the volume of requests that you expect?
What would the accuracy be in a real-life scenario?

Fine-Tuning the Model

Sometimes, more than prompt engineering is needed to get a correct output, and we need to change the behavior of the model itself. Fine-tuning involves retraining the model on a more specialized dataset to adjust its parameters and improve its understanding of the domain. This process helps the model produce more accurate and relevant responses by focusing on the specific knowledge it needs.

Fine-tuning can help improve accuracy where prompt engineering fails. This is usually only needed in specific cases:

Prompt engineering does not work well enough. For instance, if you need to moderate messages containing very specific language that the model has not seen yet, or where the distinctions are very subtle.
You want to use a smaller and cheaper/faster model.

To fine-tune a GPT model, check out the instructions from langchain. There are significant costs associated to fine-tuning, so it should be done only when the benefits justify this. In the case of GPT, inference on a fine-tuned model is about 2x more expensive than the base model.

Other Improvements

Enrichment

Providing more context can improve both the prompt and the response. So far, we have only added the message text to the prompt, but if you have other metadata available that could be relevant for the detection, you can also include this. Some examples can be the username or other user information, which can be passed in the input as well. Feel free to experiment with these.

Beyond Binary Classification: Enforcing a Full Moderation Policy

Spam is only one of the harms that you might encounter in a chat. Luckily, with a prompt, it is easy to include more policies. You can do so by extending the list:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
chat_template = ChatPromptTemplate.from_messages(
    [
        ("system", """Your task is to act as a content moderator. Review the following message from a chat user and classify it based on the criteria below:

1 - Spam: Unsolicited promotions, advertisements, or deceptive content. Note that messages with typos or about typical topics seen in spam messages are not necessarily spam.
2 - Toxicity: Offensive or insulting behavior
...
0 - Normal: Legitimate conversations or unclear cases. 

Return only the corresponding number. Any additional information will result in penalties. Ensure your judgment is unbiased and consistent with the criteria. Think step-by-step to make an accurate classification.
"""),
        ("human", "{input}"),
    ]
)

In the moderation service, you can then take different actions based on the detection. For instance, report the user for the moderators to review or delete the message in case of other violations but only flag the messages in case of spam. Check out our API documentation for some of the other actions that you can take to moderate content and review it using Stream’s dashboard.

Scaling Up

Let’s recap the code for the new spam detection service with the improvements we have made:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from stream_chat import StreamChat
import uvicorn

# Initialize app
app = FastAPI()

# Setup Stream chat. This step is needed as we need to add a user that acts as the reporter to flag the message
chat = StreamChat(api_key=your_app_api_key, api_secret=your_app_api_secret)
chat.upsert_user({"id": "spam-detector", "role": "admin"})

# Setup openai prompt
chat_template = ChatPromptTemplate.from_messages(
    [
        ("system", """Your task is to act as a content moderator. Review the following message from a chat user and classify it based on the criteria below:

1 - Spam: Unsolicited promotions, advertisements, or deceptive content. Note that messages with typos or about typical topics seen in spam messages are not necessarily spam.
0 - Non-Spam: Legitimate conversations or unclear cases. 

Return only the corresponding number (1 or 0). Any additional information will result in penalties. 
Ensure your judgment is unbiased and consistent with the criteria. Think step-by-step to make an accurate classification.
"""),
        ("human", "{input}"),
    ]
)

llm = ChatOpenAI(openai_api_key=your_openai_api_key,
    model="gpt-4o-mini",
    max_tokens=1)

llm_chain = chat_template | llm

def predict_message(llm_chain, message):
    """
    Predicts whether a given message is spam or not using the language model chain.
    """
    try:
        return int(llm_chain.invoke(message).content)
    except ValueError:
        return 0

@app.post("/")
async def webhook_handler(request: Request):
    data = await request.json()
    # Only execute this code when the webhook event corresponds to a new message
    if data["type"] == "message.new":
        text = data['message']['text']
        # Model evaluation. 
        is_spam = predict_message(llm_chain, text)
        if is_spam == 1:
         # Flag the message
            chat.flag_message(data["message"]["id"], user_id="spam_detector")
         # Optional: delete the message
            # chat.delete_message(data["message"]["id"])
    return JSONResponse(content=data)

if __name__ == "__main__":
    uvicorn.run(app, host='0.0.0.0', port=8000)

If all goes well, you can use this service in your production app and start using Stream moderation leveraging your external engine.

Last but not least, let’s mention some considerations to take when working with ML models in production.

Maintaining the Model

It is very important to monitor the model performance to ensure it does not degrade in time (and improve it when new information becomes accessible). For instance, users may adopt a different language or learn to bypass the filters you put in place. This will lead to errors in the predictions in the form of false negatives and false positives. Some of the false negatives will be flagged by users, while false positives will be identified by moderators not agreeing with the flag. Monitoring and incorporating this information to improve the model is part of maintaining your moderation system.

Scaling Up

We have already touched upon using open-source and self-hosted models in the previous blog post, along with their pros and cons. Something else worth mentioning on this topic is that for classification tasks there are open-source transformer-based models readily available, such as the widely used BERT. BERT can be finetuned on your labeled data or on data labeled with the GPT prompt itself. It is more lightweight and faster, making it a potential candidate for a moderation service. Learn how OpenAI approaches its integration with GPT.

Conclusion

In this blog post, we have demonstrated how to improve a basic spam detection model using prompt engineering, few-shot learning, parameter tuning, and larger models. These techniques enhance accuracy and reliability, making the LLM more effective for content moderation. Try implementing these improvements in your app!

Using Prompt Engineering to Refine a Large Language Model for Content Moderation

Prompt Engineering

Clarify Instructions

1. Clarified the Task

2. Improved Formatting

3. Clarified the Definition of Spam

4. Enforced the Output Format

Parameter Tuning

Few-Shot Learning

Using a Larger Model

Fine-Tuning the Model

Other Improvements

Enrichment

Beyond Binary Classification: Enforcing a Full Moderation Policy

Scaling Up

Maintaining the Model

Scaling Up

Conclusion