初心者でもできる！BeautifulSoupを使ったパースの基本｜自作で機械学習モデル・AIの使い方を学ぶ

BeautifulSoupはPythonのライブラリで、HTMLやXMLの解析に使われます。HTMLやXMLはツリー構造になっており、BeautifulSoupはその構造を解析して、要素や属性を抽出できます。

BeautifulSoupのインストール方法

BeautifulSoupはpipを使って簡単にインストールできます。

$ pip install beautifulsoup4

BeautifulSoupを使ったHTMLのパース方法

まずは、HTMLをBeautifulSoupオブジェクトに変換する必要があります。HTMLをファイルから読み込む場合は、以下のようにします。

from bs4 import BeautifulSoup
with open("sample.html") as fp:
    soup = BeautifulSoup(fp)

HTMLを文字列として直接渡す場合は、以下のようにします。

from bs4 import BeautifulSoup
html = "<html><body><p>Hello, BeautifulSoup!</p></body></html>"
soup = BeautifulSoup(html)

BeautifulSoupでの要素の抽出方法

BeautifulSoupでは、要素を抽出するためにさまざまな方法があります。以下に代表的なものを紹介します。

- タグ名での抽出

soup.find("h1")

- クラス名での抽出

soup.find(class_="header")

- id名での抽出

soup.find(id="title")

BeautifulSoupでの属性の取得方法

要素の属性を取得するには、以下のようにします。

element["属性名"]

例えば、以下のようなHTMLがある場合、imgタグのsrc属性を取得するには、以下のようにします。

<img src="image.jpg" alt="sample">

img = soup.find("img")
src = img["src"]
print(src)

出力結果

image.jpg

BeautifulSoupの応用例

以下は、ニュースサイトからタイトルとリンクを取得するプログラムの例です。なお、URLはダミーです。

import requests
from bs4 import BeautifulSoup
url = "https://www.example.com/news/"
response = requests.get(url)
soup = BeautifulSoup(response.content)
articles = soup.find_all("article")
for article in articles:
    title = article.find("h2").text
    link = article.find("a")["href"]
    print(title, link)