Understanding moderation harm labels and descriptions

What Are Moderation Categories and Labels?

Stream’s AI LLM Text engine classifies harmful content into labels (what type of harm) and descriptions (how that harm is described). Instead of relying on static keywords, it understands context, intent, and conversation history, making it far more accurate than traditional filters.

Admins define categories in plain language, while the AI applies them consistently across all user-generated content.

How Harm Labels Work

Harm labels are the building blocks of moderation. Each label represents a single harm type you want the AI to detect.

Examples include:

Bullying → Insults, threats, or repeated targeting of another user
Self Harm → Expressions of suicidal thoughts or encouragement of self-harm
Child Safety → Sexual content involving minors or attempts to exploit children
Terrorism → Glorification of terrorism or recruitment messaging

You can use Stream’s starter library (Scam, Hate Speech, Sexual Harassment, PII, Platform Bypass, etc.) or define your own custom harms.

Labels and Prompts

Each category is made up of two parts:

Harm Label → A short, descriptive name (e.g., “Bullying,” “Self Harm”)
Prompt → A clear instruction written in plain, directive language that tells the AI what to look for.

Good Prompt Examples:

“Messages where recruiters are trying to headhunt or share job descriptions.”
“Fraudulent content, phishing attempts, or deceptive practices.”
“Content that promotes hatred, discrimination, or violence against groups.”

Best Practices:

Write each prompt as a direct command.
Keep one harm per label.
Avoid vague terms like “bad” or “offensive.”
Split broad harms into narrower labels (e.g., “bullying” vs. “sexual harassment”).
Avoid action terms like “flag” or “block”

Sample Categories, Labels, and Prompts

Harm Label	Prompt Example	Notes
Bullying	Messages where a user insults, threatens, or repeatedly targets another person.	Keep focused on direct attacks toward a user.
Self Harm	Messages where a user expresses intent to harm themselves or encourages others to do so.	Capture suicidal or self-harm expressions.
Child Safety	Sexual content involving minors or attempts to exploit children.	Critical category, always block.
Sexual Content	Messages that contain sexual advances, harassment, or coercion.	Separate from general nudity to reduce false positives.
Hate Speech	Content that promotes hatred, discrimination, or violence against groups.	Include examples of protected groups in your policy context.
Scams & Fraud	Messages that attempt to deceive users, conduct phishing, or promote fraud.	Especially important for marketplaces and gaming.
Terrorism	Content that glorifies terrorism, violent extremism, or attempts to recruit members.	Sensitive category — review at high priority.
Platform Bypass	Attempts to evade moderation by using obfuscation, slang, or deliberate misspellings.	Use alongside regex for best coverage.
PII	Messages that share personally identifiable information such as phone numbers or emails.	Use regex + semantic filter to catch variations.

Using Context for Accuracy

The AI considers more than just the message, it also takes into account:

App context → A description of your platform and audience.
Conversation history → The last few messages, to catch split or ongoing harms.
Intent → Whether the message is actually harmful or just playful language.

This reduces false positives (e.g., catching “I want to kill myself” but not “That movie kills me”).

Confidence & Severity

For each harm detected, the AI provides:

Confidence Score → A % indicating how sure the AI is that the content matches the category. Admins can set thresholds (e.g., only block if confidence >95%).
Severity Level → A rating of how harmful the content is (low, medium, high, critical). This helps prioritize review queues.

Debugging & Iterating on Labels

Sometimes your prompts won’t work as intended. To refine them, use this checklist:

Clarity: Is the prompt direct and unambiguous?
Scope: Does the label cover just one harm?
Context: Did you include app-specific norms (teen chat vs. pro workplace)?
Testing: Have you tested on real conversations?
If the AI flags harmless jokes or misses key harms, adjust prompts one change at a time and retest.

Building Over Time

The best approach is iterative:

Start simple with the most critical harms.
Test and refine using real examples.
Split labels when they’re too broad.
Document and version prompts for your team.
Share proven prompts in an internal library to keep consistency.

Why This Matters

Clear, well-written categories and labels:

Make AI moderation more consistent and accurate.
Reduce false positives and missed harms.
Build trust in your community by enforcing clear, fair standards.

Next, we’ll explore moderation actions how the system responds once harmful content is detected, and how Admins can configure the right balance between automation and human review.