Advanced Moderation

LAST EDIT Nov 16 2021

Advanced moderation gives Stream Chat a powerful machine learning model that can scan user messages for spammy, explicit, or toxic content and either block or flag that content for moderation follow-up.

Advanced moderation is powered by a machine learning model that can provide a confidence interval (a 0-1 score) for any message, in each of three categories:

  • Spam: Repetitive content from the same user.

  • Explicit: Vulgar and sexually explicit language.

  • Toxic: Hate speech and abusive language.

After enabling the advanced moderation, you can customize the sensitivity of advanced moderation in the Chat Overview> Advanced Moderation settings. In this view, you can adjust the sensitivity and severity of advanced moderation through the following settings.

  • Altering the thresholds for each category of objectionable content.

  • Telling the system when to flag and when to block messages based on the scores provided by the moderation model.

An image of the Advanced Moderation settings page.

The previous screenshot shows a set of thresholds that may be appropriate for a community of adults. With these settings, explicit language will not be moderated, but toxic speech and spam messages will. In this example, the flag threshold for toxic speech (.65) is lower than spam threshold (.75) so the moderator wants to be more aggressive in flagging potential toxic speech for review than spam. Both spam and toxic thresholds are set to the same score (.95) meaning they want to block messages for either category at the same point, when the model is 95% confident that the message is spam/toxic Explicit content is set to never flag, and never block messages in this example.

Testing and Calibration

Copied!

To begin to understand how the model scores messages, there is a “Test Messages” button in the top right corner of the advanced moderation settings panel (on the Channel Type settings screen). Clicking this button will open a modal window, where you can enter as many test messages as you would like, and see what scores the model gives the message against each of the categories. The responses are also color coded to show how those messages would be handled with your currently configured thresholds. 

As in the example above, it’s not uncommon for messages to score highly in multiple categories. As in the lewd message example above, not only is there explicit language, but it is phrased as an attack on someone or something, and thus has a high toxic score as well. (Technically speaking, this system is a `Multi-label classifier` not a `Multi-class classifier`)

It also may be useful to note here, that this advanced moderation system is based on a deep learning algorithm, meaning, there are not clear heuristics deciding how messages are scored, but rather a relational network of information mapping training data into a complex “brain” that generates confidence scores for new messages based on the messages it has been trained on. 

A complex amalgamation of a lot of data points. More specifically, for the curious reader, that looks like this: 

  1. We use a pre-trained open source multi-lingual DistilBERT model, trained on billions of Wikipedia pages in 104 languages.

  2. We then selected 16 public datasets containing millions of examples of hate speech, spam and toxic text.

  3. We added a classification layer to DistilBERT model for learning how to classify a message as spam, toxic and explicit.

Exceptions to Advanced Moderation

Copied!

Advanced moderation applies to most but not all messages sent in chat. The following types of messages do not have advanced moderation applied to them.

  • Moderators and Admins are automatically exempted from all advanced moderation and Blocklist flagging/blocking. 

  • Messages sent from server-side SDKs are exempted from all moderation. 

Limitations

Copied!

Advanced moderation is a powerful tool that helps to flag offensive content and lets a small team of moderators be more productive than they would be on their own. However, offensive content is contextual and changes. While we have tried to create as inclusive and comprehensive a base list as possible, we're not the ultimate authority on objectionable content.

Please don’t hesitate to email us with any concerns around what is and isn’t being caught by our advanced moderation system.