Content Moderation
Filtering toxic, NSFW, illegal, or unsafe user content — text, images, video.
User-generated content makes products fun and profitable. It also makes them spammy and legally risky. Moderation is non-optional past ~10 active users.
Text moderation
- ★ OpenAI Moderation API — free, fast, decent across categories (hate, harassment, self-harm, sexual, violence, illicit). Default for "I just need to filter chat."
- ★ Anthropic Claude (with
text moderationprompts) — flexible; pair with a custom rubric for domain-specific moderation. - Perspective API (Jigsaw / Google) — free; great for toxicity / harassment scores; community-moderation classic.
- Cloudflare AI Workers
@cf/meta/llama-guard-3-8b— free at low volume; runs at the edge. - Hive Moderation — paid; very fast; multi-modal.
- Sightengine — paid; multi-modal.
- Akismet — for spam comments specifically; cheap.
Image / video / audio moderation
- ★ Hive Moderation — multi-modal (image / video / audio); the gold standard for serious scale.
- Sightengine — competitor; cheaper.
- AWS Rekognition Content Moderation — bundled with AWS.
- Google Cloud Vision SafeSearch — bundled with GCP.
- Cloudflare Images Moderation — beta integrations.
- Azure Content Safety — Microsoft's offering.
- Sumsub / Veriff — KYC / age verification (related but different purpose).
CSAM / illegal content
- ★ PhotoDNA (Microsoft) — free hash-matching against the NCMEC database; the standard.
- Thorn Safer — broader CSAM detection / classification.
- Cloudflare CSAM Scanning Tool — free for Cloudflare customers.
- ★ You are legally required to report CSAM in most jurisdictions (NCMEC in the US). Build the reporting flow.
Prompt injection / LLM-input safety
- ★ Lakera Guard — paid but very accurate; designed for prompt injection.
- Llama Guard 2 / 3 — Meta's open model; deploy via Cloudflare Workers AI, Replicate, or self-host.
- Promptfoo redteam — generates adversarial test cases (AI Evals).
- Anthropic prompt-injection mitigations — built into Claude; add system-prompt defenses.
- Rebuff, NeMo Guardrails — open-source guardrail frameworks.
Output moderation for AI products
- Run output through the same text-moderation classifier you use for input.
- Block PII leakage with
microsoft/presidioorpii-codex. - Add tool-call allowlists when agents have side effects.
Hashing / dedup
pHash(perceptual hash) — viasharpplugins orimage-hashnpm; flag re-uploads.- PDQ (Meta) — perceptual hashing standard for content matching.
Workflow / queue
- ★ Trust & safety queue — flag → human review → action. Build with Job Queues + an admin panel (CMS or Database GUIs).
- Cinder, Spectrum Labs, Checkstep — paid trust-and-safety platforms.
Patterns to know
- Layer defenses — automated filter → low-confidence to human → escalation tier. Don't auto-ban.
- Appeals matter — give users a way to challenge moderation. Builds trust and exposes false positives.
- Region-specific rules — DSA (EU), Online Safety Act (UK), KOSA (US states). Keep tooling configurable.
- Log everything — what was flagged, why, what action was taken, by whom. Required for compliance.
- PII handling — moderation logs often contain user content; encrypt and access-control.
Pick this if…
- Default text moderation, free: OpenAI Moderation API.
- Multi-modal, serious scale: Hive.
- User-uploaded images / video, on AWS: Rekognition.
- Edge / Cloudflare: Llama Guard via Workers AI.
- Prompt injection in an LLM app: Lakera Guard or Llama Guard.
- CSAM detection (mandatory if you host UGC): PhotoDNA + a reporting flow.