Web Dev Tools

Content Moderation

Filtering toxic, NSFW, illegal, or unsafe user content — text, images, video.

User-generated content makes products fun and profitable. It also makes them spammy and legally risky. Moderation is non-optional past ~10 active users.

Text moderation

  • OpenAI Moderation API — free, fast, decent across categories (hate, harassment, self-harm, sexual, violence, illicit). Default for "I just need to filter chat."
  • Anthropic Claude (with text moderation prompts) — flexible; pair with a custom rubric for domain-specific moderation.
  • Perspective API (Jigsaw / Google) — free; great for toxicity / harassment scores; community-moderation classic.
  • Cloudflare AI Workers @cf/meta/llama-guard-3-8b — free at low volume; runs at the edge.
  • Hive Moderation — paid; very fast; multi-modal.
  • Sightengine — paid; multi-modal.
  • Akismet — for spam comments specifically; cheap.

Image / video / audio moderation

  • Hive Moderation — multi-modal (image / video / audio); the gold standard for serious scale.
  • Sightengine — competitor; cheaper.
  • AWS Rekognition Content Moderation — bundled with AWS.
  • Google Cloud Vision SafeSearch — bundled with GCP.
  • Cloudflare Images Moderation — beta integrations.
  • Azure Content Safety — Microsoft's offering.
  • Sumsub / Veriff — KYC / age verification (related but different purpose).

CSAM / illegal content

  • PhotoDNA (Microsoft) — free hash-matching against the NCMEC database; the standard.
  • Thorn Safer — broader CSAM detection / classification.
  • Cloudflare CSAM Scanning Tool — free for Cloudflare customers.
  • You are legally required to report CSAM in most jurisdictions (NCMEC in the US). Build the reporting flow.

Prompt injection / LLM-input safety

  • Lakera Guard — paid but very accurate; designed for prompt injection.
  • Llama Guard 2 / 3 — Meta's open model; deploy via Cloudflare Workers AI, Replicate, or self-host.
  • Promptfoo redteam — generates adversarial test cases (AI Evals).
  • Anthropic prompt-injection mitigations — built into Claude; add system-prompt defenses.
  • Rebuff, NeMo Guardrails — open-source guardrail frameworks.

Output moderation for AI products

  • Run output through the same text-moderation classifier you use for input.
  • Block PII leakage with microsoft/presidio or pii-codex.
  • Add tool-call allowlists when agents have side effects.

Hashing / dedup

  • pHash (perceptual hash) — via sharp plugins or image-hash npm; flag re-uploads.
  • PDQ (Meta) — perceptual hashing standard for content matching.

Workflow / queue

  • Trust & safety queue — flag → human review → action. Build with Job Queues + an admin panel (CMS or Database GUIs).
  • Cinder, Spectrum Labs, Checkstep — paid trust-and-safety platforms.

Patterns to know

  • Layer defenses — automated filter → low-confidence to human → escalation tier. Don't auto-ban.
  • Appeals matter — give users a way to challenge moderation. Builds trust and exposes false positives.
  • Region-specific rules — DSA (EU), Online Safety Act (UK), KOSA (US states). Keep tooling configurable.
  • Log everything — what was flagged, why, what action was taken, by whom. Required for compliance.
  • PII handling — moderation logs often contain user content; encrypt and access-control.

Pick this if…

  • Default text moderation, free: OpenAI Moderation API.
  • Multi-modal, serious scale: Hive.
  • User-uploaded images / video, on AWS: Rekognition.
  • Edge / Cloudflare: Llama Guard via Workers AI.
  • Prompt injection in an LLM app: Lakera Guard or Llama Guard.
  • CSAM detection (mandatory if you host UGC): PhotoDNA + a reporting flow.

On this page