Web Scraping Tutorial with Puppeteer and Playwright
ID | EN

Web Scraping Tutorial with Puppeteer and Playwright

Monday, Dec 29, 2025

Ever needed data from a website but there’s no API? Or need to monitor product prices from multiple e-commerce sites? Web scraping is the solution.

Puppeteer and Playwright are the two most popular libraries for browser automation and web scraping in Node.js. Both can control real browsers, render JavaScript, and extract data from modern dynamic websites.

Web Scraping Use Cases

Before we start coding, let’s understand when web scraping is useful:

Use CaseExample
Price MonitoringTrack product prices on Amazon, eBay
Lead GenerationCollect business data from online directories
Content AggregationGather news from various sources
Research & AnalysisData for market research, sentiment analysis
Testing & QAE2E testing, visual regression testing
ArchivingBackup website content, historical screenshots

Important! Before scraping, make sure you understand the legal aspects:

  1. Check Terms of Service - Many websites prohibit scraping in their ToS
  2. Respect robots.txt - This file indicates which pages can be crawled
  3. Rate limiting - Don’t bombard servers with excessive requests
  4. Personal data - Be careful with GDPR and data privacy regulations
  5. Copyright - Scraped content may be protected by copyright
# Check website's robots.txt
curl https://example.com/robots.txt

As a general rule:

  • Scraping public data for personal/research purposes is usually safe
  • Scraping for commercial purposes or re-publishing content can be problematic
  • Always ask for permission if in doubt

Puppeteer vs Playwright: Which to Choose?

FeaturePuppeteerPlaywright
Browser SupportChrome/Chromium onlyChrome, Firefox, Safari (WebKit)
DeveloperGoogleMicrosoft
Auto-waitManualBuilt-in smart waiting
Parallel ExecutionBasicBrowser contexts isolation
Mobile EmulationYesYes + better device profiles
Network InterceptionYesYes + more powerful
Debugging ToolsDevToolsInspector, Trace Viewer, Codegen
API StyleCallback-based originsModern async/await from start

Recommendation:

  • Puppeteer - If you only need Chrome and are already familiar
  • Playwright - For cross-browser, more complete features, new projects

Project Setup

Install Dependencies

mkdir web-scraper && cd web-scraper
npm init -y
npm install puppeteer playwright
npm install typescript ts-node @types/node -D
npx tsc --init

Project Structure

web-scraper/
├── src/
│   ├── scrapers/
│   │   ├── puppeteer-scraper.ts
│   │   └── playwright-scraper.ts
│   ├── utils/
│   │   ├── browser.ts
│   │   └── helpers.ts
│   └── index.ts
├── data/
│   └── output/
├── package.json
└── tsconfig.json

TypeScript Configuration

// filepath: tsconfig.json
{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "lib": ["ES2020"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "resolveJsonModule": true
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules"]
}

Basic Navigation & Selectors

Puppeteer Basic Example

// filepath: src/scrapers/puppeteer-basic.ts
import puppeteer from 'puppeteer';

async function basicScraping() {
  // Launch browser
  const browser = await puppeteer.launch({
    headless: true, // false to see the browser
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  const page = await browser.newPage();

  // Set viewport
  await page.setViewport({ width: 1280, height: 800 });

  // Navigate to page
  await page.goto('https://quotes.toscrape.com', {
    waitUntil: 'networkidle2', // Wait until network is idle
  });

  // Get page title
  const title = await page.title();
  console.log('Page Title:', title);

  // Get text content with selector
  const firstQuote = await page.$eval('.quote .text', (el) => el.textContent);
  console.log('First Quote:', firstQuote);

  // Get multiple elements
  const quotes = await page.$$eval('.quote', (elements) =>
    elements.map((el) => ({
      text: el.querySelector('.text')?.textContent,
      author: el.querySelector('.author')?.textContent,
      tags: Array.from(el.querySelectorAll('.tag')).map((tag) => tag.textContent),
    }))
  );
  console.log('All Quotes:', quotes);

  await browser.close();
}

basicScraping();

Playwright Basic Example

// filepath: src/scrapers/playwright-basic.ts
import { chromium } from 'playwright';

async function basicScraping() {
  // Launch browser
  const browser = await chromium.launch({
    headless: true,
  });

  // Create context (isolated session)
  const context = await browser.newContext({
    viewport: { width: 1280, height: 800 },
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  });

  const page = await context.newPage();

  // Navigate - Playwright auto-waits for load
  await page.goto('https://quotes.toscrape.com');

  // Get page title
  const title = await page.title();
  console.log('Page Title:', title);

  // Get text with locator (recommended way)
  const firstQuote = await page.locator('.quote .text').first().textContent();
  console.log('First Quote:', firstQuote);

  // Get all quotes with locator
  const quoteLocators = page.locator('.quote');
  const count = await quoteLocators.count();

  const quotes = [];
  for (let i = 0; i < count; i++) {
    const quote = quoteLocators.nth(i);
    quotes.push({
      text: await quote.locator('.text').textContent(),
      author: await quote.locator('.author').textContent(),
      tags: await quote.locator('.tag').allTextContents(),
    });
  }
  console.log('All Quotes:', quotes);

  await browser.close();
}

basicScraping();

Selector Strategies

Choosing the right selector is key to robust scraping:

// Playwright selectors - more flexible
const page = await context.newPage();
await page.goto('https://example.com');

// CSS Selector
await page.locator('div.product-card').click();

// Text selector
await page.locator('text=Add to Cart').click();

// Combining selectors
await page.locator('article:has-text("Featured")').click();

// XPath (if CSS isn't enough)
await page.locator('xpath=//div[@data-testid="product"]').click();

// Role selector (accessibility-based)
await page.locator('role=button[name="Submit"]').click();

// Data attributes (most stable for scraping)
await page.locator('[data-product-id="123"]').click();

Tips for choosing selectors:

  1. Prioritize data attributes - [data-testid="x"] is most stable
  2. Avoid auto-generated classes - .css-1a2b3c can change
  3. Use combinations - .product-card h2 is more specific
  4. Test with browser DevTools - Paste selector in Console

Extracting Data

Extract Table Data

// filepath: src/scrapers/extract-table.ts
import { chromium } from 'playwright';

interface ProductData {
  name: string;
  price: string;
  stock: string;
  rating: string;
}

async function extractTableData(): Promise<ProductData[]> {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://webscraper.io/test-sites/e-commerce/allinone');

  // Wait for table to be visible
  await page.waitForSelector('.thumbnail');

  // Extract data from product cards
  const products = await page.evaluate(() => {
    const items: ProductData[] = [];
    const cards = document.querySelectorAll('.thumbnail');

    cards.forEach((card) => {
      items.push({
        name: card.querySelector('.title')?.getAttribute('title') || '',
        price: card.querySelector('.price')?.textContent?.trim() || '',
        stock: card.querySelector('.pull-right')?.textContent?.trim() || '',
        rating: card.querySelectorAll('.glyphicon-star').length.toString(),
      });
    });

    return items;
  });

  console.log(`Extracted ${products.length} products`);
  console.table(products);

  await browser.close();
  return products;
}

extractTableData();

Extract with Pagination

// filepath: src/scrapers/extract-with-pagination.ts
import { chromium, Page } from 'playwright';

interface Quote {
  text: string;
  author: string;
  tags: string[];
}

async function extractQuotesFromPage(page: Page): Promise<Quote[]> {
  return await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.quote')).map((el) => ({
      text: el.querySelector('.text')?.textContent || '',
      author: el.querySelector('.author')?.textContent || '',
      tags: Array.from(el.querySelectorAll('.tag')).map(
        (tag) => tag.textContent || ''
      ),
    }));
  });
}

async function scrapeAllPages() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  let allQuotes: Quote[] = [];
  let currentPage = 1;

  await page.goto('https://quotes.toscrape.com');

  while (true) {
    console.log(`Scraping page ${currentPage}...`);

    // Extract quotes from current page
    const quotes = await extractQuotesFromPage(page);
    allQuotes = [...allQuotes, ...quotes];

    // Check if there's a next button
    const nextButton = page.locator('.next > a');
    const hasNext = (await nextButton.count()) > 0;

    if (!hasNext) {
      console.log('No more pages');
      break;
    }

    // Click next and wait for navigation
    await nextButton.click();
    await page.waitForLoadState('networkidle');

    currentPage++;

    // Rate limiting - don't go too fast
    await page.waitForTimeout(1000);
  }

  console.log(`Total quotes extracted: ${allQuotes.length}`);
  await browser.close();

  return allQuotes;
}

scrapeAllPages();

Handling Dynamic Content

Modern websites often load content with JavaScript. Here’s how to handle it:

Wait for Elements

// filepath: src/scrapers/dynamic-content.ts
import { chromium } from 'playwright';

async function handleDynamicContent() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com/dynamic-page');

  // Wait for specific element
  await page.waitForSelector('.dynamic-content', {
    state: 'visible',
    timeout: 10000,
  });

  // Wait for element with specific text
  await page.waitForSelector('text=Data loaded');

  // Wait for network idle (all requests complete)
  await page.waitForLoadState('networkidle');

  // Wait for function condition
  await page.waitForFunction(() => {
    const items = document.querySelectorAll('.list-item');
    return items.length > 5;
  });

  const data = await page.locator('.dynamic-content').textContent();
  console.log('Dynamic content:', data);

  await browser.close();
}

handleDynamicContent();

Infinite Scroll

// filepath: src/scrapers/infinite-scroll.ts
import { chromium } from 'playwright';

async function handleInfiniteScroll() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com/infinite-scroll');

  // Scroll until item count is reached or limit
  const targetItems = 50;
  const maxScrolls = 10;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    // Count current items
    const itemCount = await page.locator('.item').count();
    console.log(`Items loaded: ${itemCount}`);

    if (itemCount >= targetItems) {
      console.log('Target reached!');
      break;
    }

    // Scroll to bottom
    await page.evaluate(() => {
      window.scrollTo(0, document.body.scrollHeight);
    });

    // Wait for new content
    await page.waitForTimeout(2000);

    // Check if we've reached the end
    const newItemCount = await page.locator('.item').count();
    if (newItemCount === itemCount) {
      console.log('No new items loaded - end of content');
      break;
    }

    scrollCount++;
  }

  // Extract all items
  const items = await page.locator('.item').allTextContents();
  console.log(`Total items: ${items.length}`);

  await browser.close();
}

handleInfiniteScroll();

Anti-Bot Evasion Techniques

Stealth Mode

// filepath: src/utils/stealth-browser.ts
import { chromium, Browser, BrowserContext } from 'playwright';

export async function createStealthBrowser(options: {
  headless?: boolean;
  proxy?: string;
} = {}): Promise<{ browser: Browser; context: BrowserContext }> {
  const browser = await chromium.launch({
    headless: options.headless ?? true,
    args: [
      '--disable-blink-features=AutomationControlled',
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
    ],
  });

  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
    userAgent:
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    locale: 'en-US',
    timezoneId: 'America/New_York',
    permissions: ['geolocation'],
    geolocation: { latitude: 40.7128, longitude: -74.006 },
    ...(options.proxy && {
      proxy: { server: options.proxy },
    }),
  });

  // Override navigator.webdriver
  await context.addInitScript(() => {
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined,
    });

    // Override plugins
    Object.defineProperty(navigator, 'plugins', {
      get: () => [1, 2, 3, 4, 5],
    });

    // Override languages
    Object.defineProperty(navigator, 'languages', {
      get: () => ['en-US', 'en'],
    });
  });

  return { browser, context };
}

Human-like Behavior

// filepath: src/utils/human-behavior.ts
import { Page } from 'playwright';

export async function randomDelay(min: number, max: number): Promise<void> {
  const delay = Math.random() * (max - min) + min;
  await new Promise((resolve) => setTimeout(resolve, delay));
}

export async function humanType(
  page: Page,
  selector: string,
  text: string
): Promise<void> {
  await page.click(selector);
  
  for (const char of text) {
    await page.keyboard.type(char);
    await randomDelay(50, 150); // Random delay between keystrokes
  }
}

export async function humanScroll(page: Page): Promise<void> {
  const scrollAmount = Math.floor(Math.random() * 500) + 200;
  await page.evaluate((amount) => {
    window.scrollBy({
      top: amount,
      behavior: 'smooth',
    });
  }, scrollAmount);
  await randomDelay(500, 1500);
}

Best Practices

1. Error Handling & Retry

// filepath: src/utils/retry.ts
export async function withRetry<T>(
  fn: () => Promise<T>,
  options: {
    maxRetries?: number;
    delay?: number;
    backoff?: number;
  } = {}
): Promise<T> {
  const { maxRetries = 3, delay = 1000, backoff = 2 } = options;

  let lastError: Error | undefined;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      console.log(`Attempt ${attempt} failed: ${lastError.message}`);

      if (attempt < maxRetries) {
        const waitTime = delay * Math.pow(backoff, attempt - 1);
        console.log(`Waiting ${waitTime}ms before retry...`);
        await new Promise((resolve) => setTimeout(resolve, waitTime));
      }
    }
  }

  throw lastError;
}

2. Rate Limiting

// filepath: src/utils/rate-limiter.ts
export class RateLimiter {
  private queue: (() => Promise<void>)[] = [];
  private processing = false;
  private requestsThisSecond = 0;
  private lastReset = Date.now();

  constructor(private requestsPerSecond: number = 1) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push(async () => {
        try {
          const result = await fn();
          resolve(result);
        } catch (error) {
          reject(error);
        }
      });

      this.processQueue();
    });
  }

  private async processQueue() {
    if (this.processing) return;
    this.processing = true;

    while (this.queue.length > 0) {
      const now = Date.now();

      // Reset counter every second
      if (now - this.lastReset >= 1000) {
        this.requestsThisSecond = 0;
        this.lastReset = now;
      }

      // Wait if rate limit reached
      if (this.requestsThisSecond >= this.requestsPerSecond) {
        await new Promise((resolve) =>
          setTimeout(resolve, 1000 - (now - this.lastReset))
        );
        continue;
      }

      const fn = this.queue.shift();
      if (fn) {
        this.requestsThisSecond++;
        await fn();
      }
    }

    this.processing = false;
  }
}

Conclusion

Web scraping with Puppeteer and Playwright gives you powerful capabilities to extract data from modern websites:

  1. Puppeteer - Solid choice for Chrome-only scraping
  2. Playwright - More powerful with multi-browser support and modern features
  3. Always respect ToS and rate limits - Responsible scraping
  4. Handle dynamic content - Use proper waiting strategies
  5. Anti-bot evasion - User agents, proxies, human behavior simulation
  6. Production-ready - Error handling, retry logic, logging, scheduling

Start with basic scraping, understand the target website, then scale up with advanced techniques as needed.

Resources