Flagging triggers and types of violations

Every item in your queue was flagged for a reason. Stream’s moderation system combines AI-powered engines, blocklists, and regex filters to detect potential harms. As a moderator, your role is to understand what triggered the flag, review the details, and decide whether to confirm or override the AI’s decision.

AI-Powered Categories

The most advanced triggers come from AI models. Unlike blocklists or regex, which look for exact words or patterns, AI uses context, intent, and nuance to classify harmful content.

Stream provides three AI engines you’ll work with most often: Text, Image, and Video.

AI Text Moderation

AI Text Moderation reviews messages, posts, and comments in real time. Unlike simple keyword filters, it evaluates tone, intent, and the last ten messages in the conversation to decide whether something is harmful.

Out-of-the-Box Harm Categories
When you start with AI Text, Stream provides five preconfigured harm categories you’ll see often:

Scam: Fraudulent content, phishing attempts, or deceptive practices.
Sexual Harassment: Unwanted sexual advances, comments, or behavior.
Hate Speech: Content that promotes hatred, discrimination, or violence against groups.
Personally Identifiable Information (PII): Personal information like phone numbers, addresses, or private details.
Platform Bypass: Content that attempts to circumvent platform moderation systems.

These categories cover common high-risk harms, but Admins may add custom categories specific to your community (e.g., Gambling, Self-Harm, Child Safety).

Severity Levels (Text Harms)
For each flagged text message, the AI assigns a severity level:

Low: Minor or borderline violations (e.g., mild insults, harmless spam).
Medium: More serious issues requiring review (e.g., targeted harassment, repetitive spam).
High: Clear policy violations with greater impact (e.g., graphic threats, aggressive scams).
Critical: Zero-tolerance harms requiring immediate enforcement (e.g., child exploitation, terrorist threats, self-harm urgencies).

Moderators should use severity as a triage guide, critical items should be reviewed and resolved first.

AI Image Moderation

Stream’s AI Image engine analyzes uploaded images for unsafe or inappropriate content. This goes beyond metadata, looking at the actual visual content to detect risks.

Supported Harms for Images
You’ll commonly see images flagged for:

Explicit: Sexual activity or explicit nudity.
Non-Explicit Nudity: Intimate parts or kissing.
Swimwear or Underwear: Revealing but non-explicit attire.
Violence: Depictions of fighting, assault, or weapons.
Visually Disturbing: Blood, gore, or other graphic material.
Drugs & Tobacco: Use or promotion of controlled substances.
Alcohol: Depictions of drinking or alcohol branding.
Rude Gestures: Offensive hand signals or symbols.
Gambling: Cards, dice, slot machines, or related promotion.
Hate Symbols: Flags, signs, or imagery tied to hate groups.
Personally Identifiable Information (PII): IDs, credit cards, or sensitive screenshots.
QR Codes: Often flagged because they can contain unmoderated links.

Confidence Scores (Image Harms)
Unlike text moderation, which uses severity, image moderation assigns a confidence score (0–100). This represents how certain the AI is that the image contains the detected harm.

High confidence (95%+): Very likely to be correct. Often paired with strict actions like Block.
Medium confidence (70–90%): May require human review. Usually flagged, not blocked outright.
Low confidence (<70%): Could be a false positive. Often safe to let through unless policy is strict.

Example: An image flagged for “Explicit Nudity” at 98% confidence is almost certainly a violation. An image flagged for “Violence” at 72% confidence might just be a boxing match or movie scene.

AI Video Moderation

Video moderation works like image moderation but scans video frames over time. It detects the same harms (explicit, nudity, violence, drugs, alcohol, gambling, hate symbols, etc.) but in a moving format.

Example: A short clip of a bar scene may be flagged for Alcohol at 80% confidence. A video of a fight with blood may be flagged for Violence at 97% confidence.

Moderators should always check the context: Was the video truly harmful, or was it part of harmless entertainment (like a movie scene)?

Semantic Filters

Semantic filters flag content based on meaning, not exact words or simple patterns. Instead of looking for a specific phrase like “buy now,” they can also catch paraphrases that mean the same thing (e.g., “purchase immediately,” “acquire right away”). This helps surface harmful or unwanted content even when users avoid exact blocklisted terms.

Admins build semantic filters by giving the system seed phrases that capture the essence of what should be caught (e.g., phishing recruitment, aggressive sales solicitations, grooming cues). The system then generalizes around those examples to detect semantically similar messages. In your queue, you’ll simply see that a Semantic Filter flagged the item with the related label.

What you’ll see & how to review:

Items flagged for intent, not just wording.
Messages that don’t match exact blocklist terms but clearly mean the same thing.
Fewer false positives than blunt blocklists, yet still review context (tone, audience, ongoing thread).
Use conversation history to confirm whether the meaning truly fits the filter’s purpose.

Examples you might encounter:

Commercial pressure: “Grab this deal before it’s gone—act now,” flagged as a sales/solicitation semantic match (vs. exact “buy now”).
Recruitment/grooming cues: “Let’s keep talking somewhere else,” flagged when seeds target off-platform redirection risks.
Phishing style: “Verify your account details to avoid suspension,” flagged even without known scam keywords.

Semantic filters are designed to catch avoidable phrasing and coded language that slip past blocklists/regex. When these trigger, weigh intent + context: Is the user genuinely pressuring, recruiting, or deceiving, or is it benign? If you see patterns of over- or under-flagging, leave a note so admins can adjust the seed phrases or add clarifying examples.

Blocklist Matches

Blocklists are the simplest form of automated moderation: they detect exact word or phrase matches, regardless of context. This makes them a blunt but reliable tool for enforcing “hard no” terms in your community.

How They Work

Admins define a list of banned words or phrases.
When a user posts content containing one of those terms, the system flags it.
Depending on configuration, the system may mask the word (replace it with ****), flag the message for review, or block it entirely.

Where Blocklists Are Used

Profanity Filtering: Prevent offensive terms from appearing in chat.
Compliance Terms: Ban restricted words (e.g., financial services requiring certain disclosures).
Cultural Sensitivities: Keep community guidelines aligned with local or brand standards.

Limitations Moderators Should Know

No Context: Blocklists don’t know intent. “You’re stupid” and “My computer is being stupid” both get flagged.
Case Insensitivity: “Idiot” and “IDIOT” are treated the same, but creative spellings (e.g., “1d!ot”) may slip through unless regex is used.
Over-Blocking: Harmless mentions of a blocked word (e.g., “This movie was sick”) can end up in the queue unnecessarily.

Always check the surrounding context of blocklist flags. What looks offensive out of context may be harmless in the full conversation.

Regex Filters

Regex, short for regular expressions, are more flexible than blocklists. Instead of exact matches, regex filters look for patterns in content, making them powerful for catching spam and evasion.

How They Work

Admins write a regex pattern (a rule that describes text structure).
If content matches the pattern, it’s flagged, and an action is applied.
Regex can detect things like URLs, phone numbers, repeated characters, or obfuscations.

Examples of Regex Patterns

Phone Numbers: \d{3}-\d{3}-\d{4} (flags “555-123-4567”)
Spam Links: (http|https):\/\/[^\s]+ (flags any URL)
Repetition: (ha){5,} (flags “hahahahahaha” as spam)
Obfuscation: fr[e3]{2}\s*m[o0]n[e3]y (flags “fr33 m0ney”).

Limitations Moderators Should Know

Complexity: Regex patterns can be confusing or overly broad, leading to false positives.
Performance: Very broad regex rules can slow down moderation if misconfigured.
Over-Capture: Regex might flag content that looks like a violation but isn’t (e.g., a joke containing numbers that accidentally matches a phone number pattern).

If you see repeated false positives from regex flags, leave a note in the queue for your Admin. Regex rules can be fine-tuned to reduce noise.

Feature	Blocklist	Regex
Match Type	Exact word/phrase	Pattern (flexible structure)
Good For	Profanity, banned terms	Spam, evasion, structured violations
Context Awareness	None	Limited (depends on pattern)
False Positives	High (no nuance)	Limited (depends on pattern)
Moderator Role	Check context carefully	Confirm pattern actually signals harm

Why Moderators Need to Understand Triggers

Knowing why content was flagged helps you:

Judge whether the flag was accurate (false positive vs. real violation).
Apply the right action (e.g., escalate a Critical harm vs. dismiss a harmless match).
Give feedback to Admins, helping them adjust rules, prompts, and thresholds.

Now that you know the different triggers that bring content into your queue, let’s take a closer look at how to review flagged content in detail, exploring the context, metadata, and classification information that support your decisions.