BeautifulSoupより784倍速いWebスクレイピング手法

概要¶

従来の Python スクレイピングの定番ライブラリ BeautifulSoup と比べて約784倍速い手法が登場。サイトの防御をすり抜けながら止まらず壊れずデータ取得できるという内容。

詳細¶

なぜ BeautifulSoup は遅いのか¶

# 従来の手法（BeautifulSoup）
import requests
from bs4 import BeautifulSoup

# 問題点:
# 1. 1ページずつ順番にリクエスト（シリアル処理）
# 2. JavaScript で生成されるコンテンツが取得できない
# 3. Bot 対策（レート制限・CAPTCHA）に弱い
# 4. 大量データ時のメモリ効率が悪い

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.find_all("div", class_="item")

高速スクレイピングの手法¶

784倍速に対応するアプローチとして、以下が考えられる：

1. 非同期処理 + 並列リクエスト¶

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def scrape_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        return await asyncio.gather(*tasks)  # 全URL並列取得

# 1000URLを並列取得 → シリアルより大幅に高速
results = asyncio.run(scrape_all(url_list))

2. Playwright / Puppeteer（JS レンダリング対応）¶

from playwright.async_api import async_playwright

async def scrape_with_js():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://example.com")
        # JavaScript が動いた後のDOMを取得
        data = await page.query_selector_all(".item")
        await browser.close()

3. Crawlee（Node.js）や Scrapy（Python）フレームワーク¶

# Scrapy: 高スループット、分散対応
# 自動的に:
#   - リクエストのキュー管理
#   - レート制限の遵守
#   - リトライ処理
#   - データパイプライン

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ["https://example.com"]

    def parse(self, response):
        for item in response.css(".item"):
            yield {"title": item.css("h2::text").get()}

4. AI を活用した構造化データ抽出¶

# LLM でページ内容を解析（構造が複雑なサイト向け）
from anthropic import Anthropic

client = Anthropic()
response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    messages=[{
        "role": "user",
        "content": f"Extract product names and prices from this HTML: {html_content}"
    }]
)

Bot 対策への対処¶

# User-Agent のローテーション
headers_list = [
    {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."},
    {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."},
]

# リクエスト間隔のランダム化
import random, time
time.sleep(random.uniform(1, 3))

# プロキシのローテーション（大規模スクレイピング）
proxies = {"http": "http://proxy1:port", "https": "http://proxy1:port"}

なぜ重要か / いつ使うか¶

大量のWebデータを定期的に収集する自動化パイプラインを作るとき
データ収集速度がボトルネックになっているとき
AI エージェントに Web 情報を提供するデータ収集基盤を作るとき