Build low-latency Vision AI applications using our new open-source Vision AI SDK. ⭐️ on GitHub ->

How to Build Automated Moderation From Basic Rules to LLMs

New
12 min read
Raymond F
Raymond F
Published October 29, 2025
How to Build Automated Moderation cover image

"Discord's AutoMod feature is amazing! How did we get by before that was a thing..."

If you've ever set up a community online, even a small one, you'll have seen the bad side of people. "croc122" knows this. It's not just the flame wars and toxic comments that cause problems; it's also the scammers, spam bots, and trolls. Most of your community members may want to play by the rules, but many people don't.

Moderating messages and comments manually is then a considerable challenge. You have the issue of:

  • Speed - Harmful content can spread rapidly before moderators can respond, especially during off-hours or when moderators are overwhelmed.

  • Volume - Popular communities generate thousands of posts and comments daily, making it impossible for human moderators to review everything.

  • Context decisions - Determining what should be moderated requires understanding nuance, sarcasm, inside jokes, and cultural references that vary by community.

  • 24/7 coverage - Communities are active around the clock, but volunteer moderators have limited availability, creating gaps in protection.

  • Psychological toll - Moderators face burnout from constant exposure to hate speech, graphic content, and personal attacks, leading to high turnover and mental health impacts.

This all makes content moderation a problem ideally suited for automation. 

What Is Automated Content Moderation?

Automated content moderation uses algorithms to review user-generated content against predefined rules or AI models, taking actions like blocking, removing, or flagging material without human intervention. These systems process text, images, audio, and video in real-time, operating as the first line of defense before human moderators handle complex cases.

The Three Levels of Automated Moderation Technology

Modern AutoMod systems operate at three distinct levels of technical complexity:

1. Rules-Based Filtering

The most straightforward approach uses explicit pattern matching. Reddit's AutoModerator exemplifies this: moderators can write YAML rules that trigger on exact keywords, regular expression patterns, or metadata, such as account age. A typical rule might remove any post containing blacklisted terms or block links from accounts less than 24 hours old.

Rules-based filtering in automated moderation

These systems excel at catching apparent violations but struggle with context. They'll flag "That game was killer!" as violent content, and users easily evade them with character substitution (h4te instead of hate) or spacing (h a t e).

2. Machine Learning Classification

ML-based systems use supervised learning models trained on labeled datasets to recognize patterns beyond exact matches. The technical approaches vary by content type:

For text moderation, these systems employ:

  • Traditional classifiers (SVM, Random Forest) using engineered features like n-grams, TF-IDF scores, and syntactic patterns.

  • Deep learning models, such as BERT or RoBERTa, fine-tuned on toxicity datasets, output probability scores for categories (hate speech: 0.92, threat: 0.15).

  • Ensemble methods combine multiple specialized models, where one detector targets hate speech, another catches spam, and the results are weighted together.

The key limitation is that these models recognize patterns they've seen before. They miss novel slang, emerging memes, or coded language that is not in the training data. They also struggle with irony, sarcasm, and cultural context.

3. Large Language Model Analysis

LLM-based moderation represents the newest approach, using models like OpenAI's GPT family of models, Anthropic's Claude, or open-source models like DeepSeek to understand content semantically. Unlike classifiers that output category scores, LLMs can explain why content violates policies and handle nuanced context.

The technical implementation differs fundamentally:

  • Content gets passed to the LLM with the platform's policy as context

  • The model evaluates against multiple policy dimensions simultaneously

  • It returns structured assessments with reasoning, not just scores

LLMs excel at:

  • Understanding context ("kill" in gaming versus threats)

  • Detecting coded language and dog whistles

  • Recognizing policy violations in novel phrasings

  • Providing explanations moderators can review

The tradeoffs include higher latency (200ms-2s versus < 50ms for classifiers), increased cost per evaluation, potential inconsistency between evaluations, and hallucinations.

Here, we will build three different automated content moderation systems using each of these approaches to examine how they work, as well as the strengths and limitations of each.

How Rules-Based Content Filtering Works

Rules-based systems operate by matching patterns against predefined criteria. At their core, they load rules from configuration files, apply regular expression (regex) patterns to incoming content, and trigger actions based on matches.

Writing Effective Regex Patterns for Content Filtering

Rules are typically stored in YAML or JSON for easy maintenance, or are defined in the UI of a platform and loaded from a database. Each rule contains a pattern, severity level, action, and description:

json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
spam: - pattern: "\\b(buy now|click here|free money|make \\$\\d+)\\b" severity: high action: block description: "Spam/promotional content" - pattern: "(.{3,})\\1{3,}" severity: medium action: flag description: "Repeated characters (likely spam)" harassment: - pattern: "\\b(loser|failure|worthless|pathetic)\\b" severity: medium action: flag description: "Personal attacks"

The patterns use regex with word boundaries (\b) to avoid false positives. More complex patterns detect behaviors like character repetition for spam (helloooooo) or dollar amounts in promotional content (make $5000).

How Pattern Matching Engines Process Messages

The moderator processes messages by iterating through all rule categories and applying regex searches:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def moderate_content(self, message: str) -> ModerationResult: matched_rules = [] highest_severity = SeverityLevel.LOW # Process based on case sensitivity content_to_check = message if self.case_sensitive else message.lower() for category, rules in self.rules.get("rules", {}).items(): for rule in rules: pattern = rule["pattern"] if not self.case_sensitive: pattern = pattern.lower() if re.search(pattern, content_to_check, re.IGNORECASE if not self.case_sensitive else 0): matched_rules.append(f"{category}: {rule['description']}") severity = SeverityLevel(rule["severity"]) # Update to highest severity found if severity_order[severity] > severity_order[highest_severity]: highest_severity = severity final_action = ActionType(rule["action"])

This system tracks all matched rules, not just the first match, so it provides moderators with complete context about why content was flagged.

Severity Escalation and Priority Systems

Multiple rule violations trigger the highest severity action. For instance, if a message contains both "repeated characters" (medium severity, flag action) and "buy now click here" spam (high severity, block action), the system blocks the message:

python
1
2
3
4
5
6
severity_order = { SeverityLevel.LOW: 1, SeverityLevel.MEDIUM: 2, SeverityLevel.HIGH: 3, SeverityLevel.CRITICAL: 4 }

The system implements three primary actions:

  • BLOCK: Prevents the message from appearing at all

  • FLAG: Allows the message but queues it for human review

  • WARN: Displays the message with a warning to the user

Each action returns a structured result that the platform can handle appropriately:

python
1
2
3
4
5
6
7
return ModerationResult( is_flagged=len(matched_rules) > 0, action=final_action, severity=highest_severity, matched_rules=matched_rules, message=action_message )

When a user enters a potentially harmful phrase, this can then be flagged to an admin:

Moderation violation flagged to an admin

Performance Optimization Techniques

Rules-based systems can achieve sub-millisecond performance through several techniques:

  • Precompiled regex: Patterns compile once at startup, not per message
  • Early termination: Critical violations can skip remaining checks
  • Efficient ordering: Common violations check first to minimize processing
  • Caching: Frequently seen messages cache their moderation results

The tradeoff for this speed is rigidity. Users quickly learn patterns, substituting characters (fr33 m0ney), adding spaces (b u y n o w), or using Unicode lookalikes (ḅuy now). Maintaining comprehensive rule sets becomes an arms race, with moderators constantly adding variations while trying to avoid false positives.

Machine Learning Moderation Using AI Classifiers

ML-based moderation uses pre-trained transformer models to understand content semantically rather than through pattern matching. These systems load specialized models for different aspects of moderation, process text through neural networks, and output probability scores for various violation categories.

Model Architecture and Loading Multiple Classifiers

Modern ML moderators leverage multiple specialized models working in concert. The system loads distinct models for toxicity detection, sentiment analysis, and multi-category classification:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def _load_models(self): # Toxicity detection using BERT-based models self.toxicity_classifier = pipeline( "text-classification", model="unitary/toxic-bert", device=0 if self.device == "cuda" else -1, return_all_scores=True ) # Sentiment analysis for emotional tone self.sentiment_analyzer = pipeline( "sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest", device=0 if self.device == "cuda" else -1, return_all_scores=True ) # Multi-label classification for specific violations self.multi_toxicity_classifier = pipeline( "text-classification", model="unitary/unbiased-toxic-roberta", device=0 if self.device == "cuda" else -1, return_all_scores=True )
Get started! Activate your free Stream account today and start prototyping with moderation.

Each model specializes in different aspects. The toxicity model identifies harmful content broadly, while the sentiment analyzer captures emotional context that might indicate harassment. The multi-label classifier categorizes toxicity into specific categories, such as insults, threats, or identity attacks.

The Content Processing Pipeline

When content arrives, it passes through each model to build a comprehensive assessment:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def moderate_content(self, text: str) -> MLModerationResult: # Toxicity detection toxicity_results = self.toxicity_classifier(text) # Extract toxicity score from model output toxicity_score = 0.0 for result in toxicity_results[0]: if result['label'].upper() in ['TOXIC', '1', 'TOXICITY']: toxicity_score = result['score'] break # Sentiment analysis sentiment_results = self.sentiment_analyzer(text) sentiment_data = max(sentiment_results[0], key=lambda x: x['score']) # Multi-category toxicity categories = {} if self.multi_toxicity_classifier: multi_results = self.multi_toxicity_classifier(text) for result in multi_results[0]: categories[result['label']] = result['score']

The models return probability distributions for each label. For a message like "You are a monster!!", the system generates:

  • Toxicity score: 0.958 (95.8% probability of being toxic)
  • Sentiment: NEGATIVE with 0.901 confidence
  • Categories: toxicity=0.995, insult=0.993
Probability distributions for content moderation flags

Setting Threshold-Based Decision Making

Unlike rules that match or don't, ML systems work with probability thresholds that determine actions:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def _get_toxicity_level(self, score: float) -> ToxicityLevel: if score < 0.1: return ToxicityLevel.SAFE elif score < 0.3: return ToxicityLevel.LOW elif score < 0.6: return ToxicityLevel.MEDIUM elif score < 0.8: return ToxicityLevel.HIGH else: return ToxicityLevel.SEVERE # Determine blocking decision is_toxic = toxicity_score > 0.3 should_block = toxicity_score > 0.6 or toxicity_level in [ToxicityLevel.HIGH, ToxicityLevel.SEVERE]

These thresholds create nuanced responses:

  • 0.0-0.1: Safe content, allow through

  • 0.1-0.3: Low toxicity, monitor but allow

  • 0.3-0.6: Medium toxicity, flag for review

  • 0.6-1.0: High/severe toxicity, auto-block

Platforms tune these thresholds based on community standards. Gaming communities might tolerate higher toxicity scores than educational forums.

Understanding Confidence Scores and Uncertainty

ML models provide confidence scores that indicate certainty about predictions:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Calculate overall confidence confidence = max(toxicity_score, 1.0 - toxicity_score) return MLModerationResult( text=text, is_toxic=is_toxic, toxicity_score=toxicity_score, toxicity_level=toxicity_level, sentiment=sentiment, sentiment_score=sentiment_score, should_block=should_block, confidence=confidence, categories=categories )

A high confidence level (>0.9) indicates clear-cut cases that are suitable for automation. A low confidence level (0.4-0.6) indicates ambiguous content that requires human review. This uncertainty handling prevents the system from making aggressive decisions on borderline content.

Model Limitations and Training Data Biases

ML classifiers inherit biases from training data. The toxic-bert model might flag African American Vernacular English at higher rates or miss toxicity in languages underrepresented in training. Models also struggle with:

  • Context beyond single messages (sarcasm, inside jokes)

  • Evolving slang and new forms of harassment

  • Coordinated attacks using individually benign messages

  • Adversarial inputs designed to fool classifiers

Regular retraining on platform-specific data and human feedback helps address these issues, but ML moderation remains an active area of research and refinement.

Advanced Moderation with Large Language Models

LLM-based moderation is a complete shift from pattern matching to semantic understanding. Rather than checking against predefined rules or statistical patterns, LLMs analyze content through natural language reasoning, providing detailed explanations and contextual understanding of why content violates policies.

System Architecture and Prompt Engineering

The core of LLM moderation lies in structured prompting that transforms the model into a specialized content analyst:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def _build_system_prompt(self) -> str: policies = self.config.get("policies", {}) enabled_policies = [name for name, policy in policies.items() if policy.get("enabled", True)] prompt = f"""You are an expert content moderator. Analyze messages for policy violations and respond with a structured JSON format. ENABLED POLICIES: {', '.join(enabled_policies)} ANALYSIS REQUIREMENTS: 1\. Determine if content violates any policies 2\. Assess severity level (low, medium, high, critical) 3\. Recommend action (allow, flag, block, escalate) 4\. Provide confidence score (0.0-1.0) 5\. Give clear reasoning 6\. Identify specific categories of violations 7\. Analyze context and intent 8\. Suggest appropriate response if content should be blocked RESPONSE FORMAT (JSON only): {{ "action": "allow|flag|block|escalate", "severity": "low|medium|high|critical", "confidence": 0.0-1.0, "reason": "Clear explanation of the decision", "categories": ["list", "of", "violation", "types"], "context_analysis": "Analysis of context and intent", "suggested_response": "Message to show user if blocked (or null)" }}"""

This prompt engineering transforms a general-purpose LLM into a specialized moderator that understands nuance, explains decisions, and provides actionable responses. The structured JSON output ensures consistent, parseable results while maintaining the model's ability to reason about complex cases.

Multi-Provider Implementation for Redundancy

Production systems support multiple LLM providers for redundancy and cost optimization:

python
1
2
3
4
5
6
7
8
9
10
def _initialize_provider(self) -> LLMProvider: provider_name = self.config.get("provider", "claude").lower() model = self.config.get("model", "claude-3-haiku-20240307") if provider_name == "claude": api_key = os.getenv("ANTHROPIC_API_KEY") return ClaudeProvider(api_key, model) elif provider_name == "openai": api_key = os.getenv("OPENAI_API_KEY") return OpenAIProvider(api_key, model)

Each provider implements the same interface but with model-specific optimizations:

python
1
2
3
4
5
6
7
8
9
10
11
12
def moderate_content(self, text: str, system_prompt: str) -> Dict[str, Any]: message = self.client.messages.create( model=self.model, max_tokens=1000, temperature=0.1, # Low temperature for consistency system=system_prompt, messages=[{ "role": "user", "content": f"Please analyze this message for content moderation:\n\n\"{text}\"" }] ) return {"response": message.content[0].text, "model": self.model}

The low temperature (0.1) ensures consistent moderation decisions across similar content, while the structured prompt guarantees detailed analysis.

Contextual Analysis and Reasoning Capabilities

Unlike ML classifiers that output probability scores, LLMs provide detailed reasoning about their decisions. For the message "You are a monster!!!", the system generates:

  • Reasoning: "The message contains a personal attack that could be considered harassment. While the language is not extremely severe, the accusation of being a 'monster' is a strong insult that could be seen as abusive."

  • Context Analysis: "Without additional context, this message appears to be a direct personal attack on the recipient. The intent seems to be to insult and demean the other person, which violates policies against harassment and toxic behavior."

This reasoning helps moderators understand borderline cases and provides transparency for users about why their content was flagged.

Dynamic Policy Configuration and Flexibility

The system uses YAML configuration to define moderation policies dynamically:

json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
policies: toxicity: enabled: true severity: "high" description: "Harmful, abusive, or toxic language" harassment: enabled: true severity: "critical" description: "Personal attacks, bullying, or harassment" misinformation: enabled: true severity: "high" description: "False or misleading information"

This configuration gets incorporated into the prompt, allowing platforms to adjust policies without code changes:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Decision thresholds determine actions thresholds: block_threshold: 0.8 # Confidence level to automatically block flag_threshold: 0.5 # Confidence level to flag for review escalate_threshold: 0.9 # Confidence level to escalate to humans ``` ### Structured Output Processing and JSON Parsing The system parses structured JSON responses from the LLM to ensure reliability: ```python def moderate_content(self, text: str) -> LLMModerationResult: try: response = self.provider.moderate_content(text, self.system_prompt) # Extract JSON from response response_text = response["response"].strip() json_start = response_text.find('{') json_end = response_text.rfind('}') + 1 json_text = response_text[json_start:json_end] result_data = json.loads(json_text) return LLMModerationResult( text=text, action=ModerationAction(result_data["action"]), severity=SeverityLevel(result_data["severity"]), confidence=float(result_data["confidence"]), reason=result_data["reason"], categories=result_data["categories"], context_analysis=result_data["context_analysis"], suggested_response=result_data.get("suggested_response"), processing_time=processing_time, model_used=response["model"] )

This structured approach handles parsing failures gracefully, defaulting to flagging content when the LLM response is malformed.

Advantages Over Traditional Moderation Approaches

LLM moderation excels where rules and ML fail:

  • Context Understanding: The system recognizes that "You're killing it!" in response to a performance differs from "You're killing me!" as harassment. It understands gaming contexts, professional discussions, and cultural references without explicit programming.

  • Novel Content Handling: When new slang, memes, or harassment tactics emerge, LLMs can recognize harmful intent without retraining. They understand coded language, dog whistles, and coordinated harassment campaigns that use individually benign messages.

  • Explanation Generation: Each decision includes actionable feedback:

"Suggested  Response":  "Please  refrain  from  using  abusive  or  insulting  language.  We  want  to  maintain  a  respectful  environment  for  all  users."
  • Multi-dimensional Analysis: A single pass evaluates content across all policy dimensions simultaneously, identifying overlapping violations (harassment + toxicity) and their relative severity.

Current Limitations and Technical Challenges

Despite advantages, LLM moderation faces constraints:

  • Consistency: Different API calls might produce slightly different decisions for identical content.

  • Latency: Real-time applications (live chat, gaming) may find 1-3 second delays unacceptable.

  • Cost at Scale: Moderating millions of messages daily becomes expensive.

  • Context Windows: Long conversation threads might exceed token limits.

  • Prompt Injection: Malicious users might try to manipulate the moderation prompt.

Production systems typically use LLMs as part of a hybrid approach: rules catch obvious spam, ML filters high-volume content, and LLMs handle nuanced cases, appeals, and policy development through analyzing edge cases.

Build or Buy Your Moderation System

Building automated moderation in-house means weeks of developing rule engines, months of training ML models, and ongoing battles with evolving threats. You'll need dedicated engineers, data scientists, and infrastructure just to match basic commercial offerings.

Stream Moderation combines rules-based, ML, and LLM approaches in one enterprise-grade system. Instead of juggling multiple models and maintenance headaches, you get pre-configured filters, custom rules, and automatic improvements.

Your community stays protected while Stream handles the complex technical challenges and continuous updates. Focus on growing your platform while proven moderation technology keeps your users safe.