BeautifulSoupで複数ページ・複数クラスからデータ抽出 ! ｜自作で機械学習モデル・AIの使い方を学ぶ

BeautifulSoupはPythonのライブラリで、HTMLやXMLなどのマークアップ言語を解析し、データを抽出するために使用されます。この記事では、BeautifulSoupを使用して複数ページや複数のクラスからデータを抽出する方法を説明します。

BeautifulSoupの基本的な使用方法

まずはBeautifulSoupの基本的な使用方法を説明します。以下のようなHTMLがあるとします。

<html>
  <head>
    <title>Example</title>
  </head>
  <body>
    <p class="content">This is content.</p>
    <p class="content">This is also content.</p>
    <p class="other">This is other.</p>
  </body>
</html>

このHTMLから、”content”クラスに属するテキストを抽出するには、以下のようにコードを書きます。

from bs4 import BeautifulSoup
html = '<html><head><title>Example</title></head><body><p class="content">This is content.</p><p class="content">This is also content.</p><p class="other">This is other.</p></body></html>'
soup = BeautifulSoup(html, 'html.parser')
content_p = soup.find_all('p', {'class': 'content'})
for p in content_p:
    print(p.text)

上記のコードでは、find_all()メソッドを使用して、class属性が”content”である全ての<p>タグを抽出しています。そして、text属性を使用して、テキスト部分を抽出しています。

複数のクラスからデータを抽出する方法

次に、複数のクラスからデータを抽出する方法を説明します。以下のようなHTMLがあるとします。

<html>
  <head>
    <title>Example</title>
  </head>
  <body>
    <p class="content">This is content 1.</p>
    <p class="content">This is content 2.</p>
    <p class="other">This is other.</p>
  </body>
</html>

このHTMLから、”content”クラスと”other”クラスに属するテキストを抽出するには、以下のようにコードを書きます。

from bs4 import BeautifulSoup
html = '<html><head><title>Example</title></head><body><p class="content">This is content 1.</p><p class="content">This is content 2.</p><p class="other">This is other.</p></body></html>'
soup = BeautifulSoup(html, 'html.parser')
content_and_other_p = soup.find_all('p', {'class': ['content', 'other']})
for p in content_and_other_p:
    print(p.text)

上記のコードでは、find_all()メソッドの第2引数にリストを渡すことで、”content”クラスと”other”クラスに属する全ての<p>タグを抽出しています。

複数ページからデータを抽出する方法

次に、複数ページからデータを抽出する方法を説明します。以下のような2つのHTMLページがあるとします。

page1.html
<html>
  <head>
    <title>Page 1</title>
  </head>
  <body>
    <p class="content">This is content 1 on page 1.</p>
  </body>
</html>
page2.html
<html>
  <head>
    <title>Page 2</title>
  </head>
  <body>
    <p class="content">This is content 1 on page 2.</p>
  </body>
</html>

これらのHTMLから、”content”クラスに属するテキストを抽出するには、以下のようにコードを書きます。

from bs4 import BeautifulSoup
import requests
pages = ['http://example.com/page1.html', 'http://example.com/page2.html']
for page in pages:
    res = requests.get(page)
    soup = BeautifulSoup(res.text, 'html.parser')
    content_p = soup.find_all('p', {'class': 'content'})
    for p in content_p:
        print(p.text)

上記のコードでは、requestsモジュールを使用して、2つのHTMLページを取得しています。そして、find_all()メソッドを使用して、class属性が”content”である全ての<p>タグを抽出しています。最後に、text属性を使用して、テキスト部分を抽出しています。

実際のコードを使った複数クラス・複数ページからのデータ抽出の例

以下は、実際に複数クラス・複数ページからのデータ抽出を行うコードの例です。

from bs4 import BeautifulSoup
import requests
pages = ['http://example.com/page1.html', 'http://example.com/page2.html']
for page in pages:
    res = requests.get(page)
    soup = BeautifulSoup(res.text, 'html.parser')
    content_and_other_p = soup.find_all('p', {'class': ['content', 'other']})
    for p in content_and_other_p:
        print(p.text)

上記のコードでは、2つのHTMLページから、”content”クラスと”other”クラスに属する全ての<p>タグのテキストを抽出しています。

BeautifulSoupを使ったデータ抽出の注意点

BeautifulSoupを使ったデータ抽出には以下のような注意点があります。

HTMLの構造が変更されると、データ抽出に失敗する可能性がある。
HTMLの構造が複雑な場合、データ抽出に時間がかかる。
HTMLの構造によっては、データ抽出が困難な場合がある。

以上の点に留意しながら、BeautifulSoupを使用することで、簡単にデータ抽出ができます。

BeautifulSoupを用いたデータ分析の応用例

以下は、BeautifulSoupを用いたデータ分析の応用例です。

あるECサイトの商品ページから、商品名と価格を抽出し、CSVファイルに出力するとします。以下のようなコードを使用します。

from bs4 import BeautifulSoup
import requests
import csv
url = 'http://example.com/product1.html'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
product_name = soup.find('h1', {'class': 'product-name'}).text
product_price = soup.find('span', {'class': 'price'}).text
with open('product.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['商品名', '価格'])
    writer.writerow([product_name, product_price])

上記のコードでは、csvモジュールを使用して、CSVファイルに出力しています。

まとめ

本記事では、BeautifulSoupを使用して複数ページや複数のクラスからデータを抽出する方法を説明しました。また、実際のコードを使用した複数クラス・複数ページからのデータ抽出の例や、BeautifulSoupを用いたデータ分析の応用例も紹介しました。BeautifulSoupを使用することで、簡単にWebスクレイピングやデータ抽出ができるため、データ分析において重要なツールの1つです。