Web Scraping & Browser Automation
Headless browsers, scrapers, RPA, and AI-driven web automation.
Headless browsers
- ★ Playwright — the default. Multi-browser (Chromium, Firefox, WebKit), great auto-wait, locators, traces, codegen, scriptable across Node / Python / Java / .NET.
- Puppeteer — Chrome-only; older but still huge install base.
- Selenium / WebDriver — when you need real Selenium grids or BiDi protocol; mature.
- Cypress — better-known as a test framework, but works for scraping.
HTML parsing (no browser, just HTML)
- ★ Cheerio — jQuery-like server-side HTML parser; fast for static HTML scraping.
- node-html-parser — fastest pure-JS parser; less ergonomic than Cheerio.
- linkedom — lightweight DOM in Node.
- jsdom — full DOM implementation; heavier, more accurate.
- Parsel — CSS selector library; works with any DOM.
- htmlparser2 — fast streaming parser.
Hosted browser automation
- ★ Browserbase — managed Chrome instances; built for AI agents and scraping; TS SDK; generous free tier. The most popular new pick.
- Browserless — hosted Puppeteer; pay-per-second.
- Steel.dev — open source + hosted; competitor to Browserbase.
- Hyperbrowser — newer hosted browser API.
- ScrapingBee, ScraperAPI, ZenRows — proxy + render services for anti-bot-heavy sites.
- Apify — full scraping platform with marketplace of pre-built actors.
AI-driven browser agents
- ★ Stagehand (Browserbase) — natural-language browser automation on top of Playwright; the most popular AI scraping framework in 2026.
- browser-use — Python; LangChain-flavored browser agent.
- WebVoyager — research agent; less commercial.
- Skyvern, Multi-On, Adept ACT — commercial browser agents.
Crawl orchestration
- ★ Crawlee (Apify) — batteries-included crawling framework on top of Playwright / Puppeteer / Cheerio. Queues, dedup, proxies, fingerprinting.
- node-crawler — older.
@mozilla/readability— extract main article content from HTML; pair with any scraper.
Anti-detection
- Playwright Extra + stealth plugin — bot fingerprint hiding.
- puppeteer-extra-plugin-stealth — same idea for Puppeteer.
- Camoufox — privacy-hardened Firefox build for scraping.
- Residential proxies — Bright Data, Smartproxy, IPRoyal, Oxylabs, Decodo.
Be a good citizen
- Respect
robots.txt— many sites disallow it; check before scraping. - Rate-limit yourself —
pacer,p-limit, etc.; treat each origin gently. - Identify your bot in
User-Agent(MyBot/1.0 (+contact@example.com)). - Handle 429 with exponential backoff.
- Cache aggressively — don't re-fetch the same page within a session.
- Some scraping is illegal in some jurisdictions; this isn't legal advice.
Pick this if…
- Default browser automation: Playwright.
- Static HTML, no JS rendering needed: Cheerio.
- Hosted browsers / AI agents: Browserbase + Stagehand.
- Heavy anti-bot target: ScrapingBee or ZenRows + residential proxies.
- Multi-page crawler: Crawlee.
- Article extraction:
@mozilla/readability.