BeautifulSoupでWebスクレイピング !指定範囲のデータを抽出｜自作で機械学習モデル・AIの使い方を学ぶ

BeautifulSoupとは

BeautifulSoupは、Pythonのライブラリで、HTMLやXMLなどのマークアップ言語からデータを抽出するために使用されます。

BeautifulSoupのインストール方法

BeautifulSoupをインストールするには、以下のコマンドをターミナルで実行します。

pip install beautifulsoup4

BeautifulSoupでのデータ指定の基本

BeautifulSoupでデータを指定するには、タグ名、属性名、属性値を使用します。

以下は、タグ名が「a」であるリンクをすべて取得する例です。

from bs4 import BeautifulSoup
import requests
url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

BeautifulSoupでの範囲指定の方法

範囲指定をするには、指定したタグの中から、別のタグを指定することで、その範囲内のデータを取得することができます。

以下は、クラス名が「example」というdivタグ内にある、クラス名が「highlight」というspanタグのテキストを取得する例です。

from bs4 import BeautifulSoup
import requests
url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
example = soup.find('div', class_='example')
highlight = example.find('span', class_='highlight')
print(highlight.text)

BeautifulSoupを用いた具体的なスクレイピングの手順

スクレイピングしたいWebサイトのURLを設定する。
requestsモジュールを使ってWebサイトのHTMLデータを取得する。
BeautifulSoupを使ってHTMLデータを解析する。
必要なデータを指定して取得する。

以下は、スクレイピングをする例です。

from bs4 import BeautifulSoup
import requests
url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# タイトルを取得する
title = soup.find('title').text
print(title)
# 本文を取得する
article = soup.find('div', class_='article')
print(article.text)

BeautifulSoupを利用したデータ分析の例

以下は、ニュースサイトからタイトルと本文をスクレイピングして、WordCloudを作成する例です。

from bs4 import BeautifulSoup
import requests
from wordcloud import WordCloud
url = 'ニュースサイトのURL'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
titles = soup.find_all('h2')
texts = soup.find_all('p')
title_text = ''
for title in titles:
    title_text += title.text + ' '
text_text = ''
for text in texts:
    text_text += text.text + ' '
wordcloud = WordCloud(width=800, height=800, background_color='white').generate(text_text)
# 画像として保存する
wordcloud.to_file('wordcloud.png')

まとめ

BeautifulSoupは、Pythonのライブラリで、HTMLやXMLなどのマークアップ言語からデータを抽出するために使用されます。データの指定方法や範囲指定の方法を理解していれば、Webスクレイピングは比較的容易に行えます。また、スクレイピングしたデータを利用して、データ分析などの応用も可能です。