1. 概要
前回はPrismaを使ってMySQLにデータを登録しながら抽出まで行う内容でした。今回はCrawleeを使ってウェブページをスクレイピングする内容になります。
対象としては開発を1年程やってて自分で最初から開発してみたい方になります。そのため細かい用語などの説明はしません。
2. Crawleeとは
Crawlee is a web scraping and browser automation library.
3. nodeのインストール
こちらを参考
4. プロジェクトを作成
npx crawlee create crawlee-sample
? Please select the template for your new Crawlee project Getting started example [TypeScript]
added 291 packages, and audited 292 packages in 6s
68 packages are looking for funding
run `npm fund` for details
found 0 vulnerabilities
Project crawlee-sample was created. To run it, run "cd crawlee-sample" and "npm start".
- 下記を選択して作成
- Getting started example [TypeScript]
5. ソースコード
5-1. src/main.ts
import { PlaywrightCrawler } from "crawlee";
type Book = {
title?: string | null | undefined;
price?: string | null | undefined;
};
const crawler = new PlaywrightCrawler({
launchContext: {
launchOptions: {
headless: true,
},
},
async requestHandler({ request, page, enqueueLinks, log, pushData }) {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);
const books = await page.$$eval("article", ($articles) =>
$articles.map(($article) => {
const title: string | null | undefined = $article
.querySelector("h3>a")
?.getAttribute("title");
const price: string | null | undefined =
$article.querySelector("div>p")?.textContent;
const book: Book = { title, price };
return book;
})
);
console.log("==========================================================");
for (const book of books) {
console.log(`Book:${book.title}, ${book.price}`);
}
console.log("==========================================================");
},
failedRequestHandler({ request, log }) {
log.info(`Request ${request.url} failed too many times.`);
},
maxRequestsPerCrawl: 20,
});
await crawler.addRequests(["https://books.toscrape.com/"]);
await crawler.run();
console.log(`Crawler finished!`);
5-2. 対象スクレイピングページ
5-3. 取得データ
- Title
- Price
6. 実行
npm start
> crawlee-sample@0.0.1 start
> npm run start:dev
> crawlee-sample@0.0.1 start:dev
> tsx src/main.ts
INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: Title of https://books.toscrape.com/ is 'All products | Books to Scrape - Sandbox'
==========================================================
Book:A Light in the Attic, £51.77
Book:Tipping the Velvet, £53.74
Book:Soumission, £50.10
Book:Sharp Objects, £47.82
Book:Sapiens: A Brief History of Humankind, £54.23
Book:The Requiem Red, £22.65
Book:The Dirty Little Secrets of Getting Your Dream Job, £33.34
Book:The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull, £17.93
Book:The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics, £22.60
Book:The Black Maria, £52.15
Book:Starving Hearts (Triangular Trade Trilogy, #1), £13.99
Book:Shakespeare's Sonnets, £20.66
Book:Set Me Free, £17.46
Book:Scott Pilgrim's Precious Little Life (Scott Pilgrim #1), £52.29
Book:Rip it Up and Start Again, £35.02
Book:Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991, £57.25
Book:Olio, £23.88
Book:Mesaerion: The Best Science Fiction Stories 1800-1849, £37.59
Book:Libertarianism for Beginners, £51.33
Book:It's Only the Himalayas, £45.17
==========================================================
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3527,"requestsFinishedPerMinute":15,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3527,"requestsTotal":1,"crawlerRuntimeMillis":3998}
INFO PlaywrightCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
Crawler finished!
7. ディレクトリの構造
.
├── Dockerfile
├── README.md
├── package-lock.json
├── package.json
├── src
│ └── main.ts
├── storage
│ ├── key_value_stores
│ └── request_queues
└── tsconfig.json
4 directories, 6 files
8. 備考
今回はCrawleeを使ってウェブページをスクレイピングする内容でした。
9. 参考
投稿者プロフィール
-
開発好きなシステムエンジニアです。
卓球にハマってます。