Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

Moderation Certification Course

Understanding moderation harm labels and descriptions

This lesson explains how moderation harm labels and descriptions work in Stream. It covers how to define categories, create clear prompts, and use AI to detect harmful content such as bullying, self-harm, scams, and hate speech. You’ll also learn best practices for writing precise labels, testing and refining prompts, and building an effective moderation framework that reduces false positives while protecting your community.

What Are Moderation Categories and Labels?

Stream’s AI LLM Text engine classifies harmful content into labels (what type of harm) and descriptions (how that harm is described). Instead of relying on static keywords, it understands context, intent, and conversation history, making it far more accurate than traditional filters.

Admins define categories in plain language, while the AI applies them consistently across all user-generated content.

How Harm Labels Work

Harm labels are the building blocks of moderation. Each label represents a single harm type you want the AI to detect.

Examples include:

  • Bullying → Insults, threats, or repeated targeting of another user
  • Self Harm → Expressions of suicidal thoughts or encouragement of self-harm
  • Child Safety → Sexual content involving minors or attempts to exploit children
  • Terrorism → Glorification of terrorism or recruitment messaging

You can use Stream’s starter library (Scam, Hate Speech, Sexual Harassment, PII, Platform Bypass, etc.) or define your own custom harms.

Labels and Prompts

Each category is made up of two parts:

  • Harm Label → A short, descriptive name (e.g., “Bullying,” “Self Harm”)
  • Prompt → A clear instruction written in plain, directive language that tells the AI what to look for.

Good Prompt Examples:

  • “Messages where recruiters are trying to headhunt or share job descriptions.”
  • “Fraudulent content, phishing attempts, or deceptive practices.”
  • “Content that promotes hatred, discrimination, or violence against groups.”

Best Practices:

  • Write each prompt as a direct command.
  • Keep one harm per label.
  • Avoid vague terms like “bad” or “offensive.”
  • Split broad harms into narrower labels (e.g., “bullying” vs. “sexual harassment”).
  • Avoid action terms like “flag” or “block”

Sample Categories, Labels, and Prompts

Harm Label Prompt Example Notes
Bullying Messages where a user insults, threatens, or repeatedly targets another person. Keep focused on direct attacks toward a user.
Self Harm Messages where a user expresses intent to harm themselves or encourages others to do so. Capture suicidal or self-harm expressions.
Child Safety Sexual content involving minors or attempts to exploit children. Critical category, always block.
Sexual Content Messages that contain sexual advances, harassment, or coercion. Separate from general nudity to reduce false positives.
Hate Speech Content that promotes hatred, discrimination, or violence against groups. Include examples of protected groups in your policy context.
Scams & Fraud Messages that attempt to deceive users, conduct phishing, or promote fraud. Especially important for marketplaces and gaming.
Terrorism Content that glorifies terrorism, violent extremism, or attempts to recruit members. Sensitive category — review at high priority.
Platform Bypass Attempts to evade moderation by using obfuscation, slang, or deliberate misspellings. Use alongside regex for best coverage.
PII Messages that share personally identifiable information such as phone numbers or emails. Use regex + semantic filter to catch variations.

Using Context for Accuracy

The AI considers more than just the message, it also takes into account:

  • App context → A description of your platform and audience.
  • Conversation history → The last few messages, to catch split or ongoing harms.
  • Intent → Whether the message is actually harmful or just playful language.

This reduces false positives (e.g., catching “I want to kill myself” but not “That movie kills me”).

Confidence & Severity

For each harm detected, the AI provides:

  • Confidence Score → A % indicating how sure the AI is that the content matches the category. Admins can set thresholds (e.g., only block if confidence >95%).
  • Severity Level → A rating of how harmful the content is (low, medium, high, critical). This helps prioritize review queues.

Debugging & Iterating on Labels

Sometimes your prompts won’t work as intended. To refine them, use this checklist:

  • Clarity: Is the prompt direct and unambiguous?
  • Scope: Does the label cover just one harm?
  • Context: Did you include app-specific norms (teen chat vs. pro workplace)?
  • Testing: Have you tested on real conversations?
    If the AI flags harmless jokes or misses key harms, adjust prompts one change at a time and retest.

Building Over Time

The best approach is iterative:

  1. Start simple with the most critical harms.
  2. Test and refine using real examples.
  3. Split labels when they’re too broad.
  4. Document and version prompts for your team.
  5. Share proven prompts in an internal library to keep consistency.

Why This Matters

Clear, well-written categories and labels:

  • Make AI moderation more consistent and accurate.
  • Reduce false positives and missed harms.
  • Build trust in your community by enforcing clear, fair standards.

Next, we’ll explore moderation actions how the system responds once harmful content is detected, and how Admins can configure the right balance between automation and human review.