Content Moderation

Filtering toxic, NSFW, illegal, or unsafe user content — text, images, video.

User-generated content makes products fun and profitable. It also makes them spammy and legally risky. Moderation is non-optional past ~10 active users.

Text moderation

★ OpenAI Moderation API — free, fast, decent across categories (hate, harassment, self-harm, sexual, violence, illicit). Default for "I just need to filter chat."
★ Anthropic Claude (with text moderation prompts) — flexible; pair with a custom rubric for domain-specific moderation.
Perspective API (Jigsaw / Google) — free; great for toxicity / harassment scores; community-moderation classic.
Cloudflare AI Workers @cf/meta/llama-guard-3-8b — free at low volume; runs at the edge.
Hive Moderation — paid; very fast; multi-modal.
Sightengine — paid; multi-modal.
Akismet — for spam comments specifically; cheap.

Image / video / audio moderation

★ Hive Moderation — multi-modal (image / video / audio); the gold standard for serious scale.
Sightengine — competitor; cheaper.
AWS Rekognition Content Moderation — bundled with AWS.
Google Cloud Vision SafeSearch — bundled with GCP.
Cloudflare Images Moderation — beta integrations.
Azure Content Safety — Microsoft's offering.
Sumsub / Veriff — KYC / age verification (related but different purpose).

CSAM / illegal content

★ PhotoDNA (Microsoft) — free hash-matching against the NCMEC database; the standard.
Thorn Safer — broader CSAM detection / classification.
Cloudflare CSAM Scanning Tool — free for Cloudflare customers.
★ You are legally required to report CSAM in most jurisdictions (NCMEC in the US). Build the reporting flow.

Prompt injection / LLM-input safety

★ Lakera Guard — paid but very accurate; designed for prompt injection.
Llama Guard 2 / 3 — Meta's open model; deploy via Cloudflare Workers AI, Replicate, or self-host.
Promptfoo redteam — generates adversarial test cases (AI Evals).
Anthropic prompt-injection mitigations — built into Claude; add system-prompt defenses.
Rebuff, NeMo Guardrails — open-source guardrail frameworks.

Output moderation for AI products

Run output through the same text-moderation classifier you use for input.
Block PII leakage with microsoft/presidio or pii-codex.
Add tool-call allowlists when agents have side effects.

Hashing / dedup

pHash (perceptual hash) — via sharp plugins or image-hash npm; flag re-uploads.
PDQ (Meta) — perceptual hashing standard for content matching.

Workflow / queue

★ Trust & safety queue — flag → human review → action. Build with Job Queues + an admin panel (CMS or Database GUIs).
Cinder, Spectrum Labs, Checkstep — paid trust-and-safety platforms.

Patterns to know

Layer defenses — automated filter → low-confidence to human → escalation tier. Don't auto-ban.
Appeals matter — give users a way to challenge moderation. Builds trust and exposes false positives.
Region-specific rules — DSA (EU), Online Safety Act (UK), KOSA (US states). Keep tooling configurable.
Log everything — what was flagged, why, what action was taken, by whom. Required for compliance.
PII handling — moderation logs often contain user content; encrypt and access-control.

Pick this if…

Default text moderation, free: OpenAI Moderation API.
Multi-modal, serious scale: Hive.
User-uploaded images / video, on AWS: Rekognition.
Edge / Cloudflare: Llama Guard via Workers AI.
Prompt injection in an LLM app: Lakera Guard or Llama Guard.
CSAM detection (mandatory if you host UGC): PhotoDNA + a reporting flow.

Previous

Document Parsing for RAG

Next

Email

On this page

Text moderation Image / video / audio moderation CSAM / illegal content Prompt injection / LLM-input safety Output moderation for AI products Hashing / dedup Workflow / queue Patterns to know Pick this if…