【NodeJS】Crawleeを使ってみる(Scraping with TypeScript)

Node.js

1. 概要

前回はPrismaを使ってMySQLにデータを登録しながら抽出まで行う内容でした。今回はCrawleeを使ってウェブページをスクレイピングする内容になります。

対象としては開発を1年程やってて自分で最初から開発してみたい方になります。そのため細かい用語などの説明はしません。

2. Crawleeとは

Crawlee is a web scraping and browser automation library.

3. nodeのインストール

こちらを参考

4. プロジェクトを作成

npx crawlee create crawlee-sample
? Please select the template for your new Crawlee project Getting started example [TypeScript]

added 291 packages, and audited 292 packages in 6s

68 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities
Project crawlee-sample was created. To run it, run "cd crawlee-sample" and "npm start".
  • 下記を選択して作成
    • Getting started example [TypeScript]

5. ソースコード

5-1. src/main.ts

import { PlaywrightCrawler } from "crawlee";

type Book = {
  title?: string | null | undefined;
  price?: string | null | undefined;
};

const crawler = new PlaywrightCrawler({
  launchContext: {
    launchOptions: {
      headless: true,
    },
  },
  async requestHandler({ request, page, enqueueLinks, log, pushData }) {
    const title = await page.title();
    log.info(`Title of ${request.loadedUrl} is '${title}'`);

    const books = await page.$$eval("article", ($articles) =>
      $articles.map(($article) => {
        const title: string | null | undefined = $article
          .querySelector("h3>a")
          ?.getAttribute("title");
        const price: string | null | undefined =
          $article.querySelector("div>p")?.textContent;
        const book: Book = { title, price };
        return book;
      })
    );
    console.log("==========================================================");
    for (const book of books) {
      console.log(`Book:${book.title}, ${book.price}`);
    }
    console.log("==========================================================");
  },
  failedRequestHandler({ request, log }) {
    log.info(`Request ${request.url} failed too many times.`);
  },
  maxRequestsPerCrawl: 20,
});

await crawler.addRequests(["https://books.toscrape.com/"]);
await crawler.run();

console.log(`Crawler finished!`);

5-2. 対象スクレイピングページ

5-3. 取得データ

  • Title
  • Price

6. 実行

npm start
> crawlee-sample@0.0.1 start
> npm run start:dev


> crawlee-sample@0.0.1 start:dev
> tsx src/main.ts

INFO  PlaywrightCrawler: Starting the crawler.
INFO  PlaywrightCrawler: Title of https://books.toscrape.com/ is 'All products | Books to Scrape - Sandbox'
==========================================================
Book:A Light in the Attic, £51.77
Book:Tipping the Velvet, £53.74
Book:Soumission, £50.10
Book:Sharp Objects, £47.82
Book:Sapiens: A Brief History of Humankind, £54.23
Book:The Requiem Red, £22.65
Book:The Dirty Little Secrets of Getting Your Dream Job, £33.34
Book:The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull, £17.93
Book:The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics, £22.60
Book:The Black Maria, £52.15
Book:Starving Hearts (Triangular Trade Trilogy, #1), £13.99
Book:Shakespeare's Sonnets, £20.66
Book:Set Me Free, £17.46
Book:Scott Pilgrim's Precious Little Life (Scott Pilgrim #1), £52.29
Book:Rip it Up and Start Again, £35.02
Book:Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991, £57.25
Book:Olio, £23.88
Book:Mesaerion: The Best Science Fiction Stories 1800-1849, £37.59
Book:Libertarianism for Beginners, £51.33
Book:It's Only the Himalayas, £45.17
==========================================================
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3527,"requestsFinishedPerMinute":15,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3527,"requestsTotal":1,"crawlerRuntimeMillis":3998}
INFO  PlaywrightCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
Crawler finished!

7. ディレクトリの構造

.
├── Dockerfile
├── README.md
├── package-lock.json
├── package.json
├── src
│   └── main.ts
├── storage
│   ├── key_value_stores
│   └── request_queues
└── tsconfig.json

4 directories, 6 files

8. 備考

今回はCrawleeを使ってウェブページをスクレイピングする内容でした。

9. 参考

投稿者プロフィール

Sondon
開発好きなシステムエンジニアです。
卓球にハマってます。

関連記事

  1. Node.js

    【NodeJS】PrismaやTypeScriptを使ってデータを操作…

  2. Node.js

    【NodeJS】PrismaやTypeScript、MySQLを使って…

最近の記事

  1. Node.js
  2. AWS
  3. AWS

制作実績一覧

  1. Checkeys