BeautifulSoupでUTF-8を扱う方法：基礎から応用まで｜自作で機械学習モデル・AIの使い方を学ぶ

Webスクレイピングにおいて、BeautifulSoupは非常に便利なツールです。しかし、UTF-8の扱いについては注意が必要です。本記事では、BeautifulSoupとUTF-8についての基礎知識から応用までを解説します。

BeautifulSoupとUTF-8の基本的な知識

BeautifulSoupはHTMLやXMLなどのマークアップ言語を解析するためのライブラリです。UTF-8は、Unicodeの一種であり、日本語を含む多言語に対応した文字コードです。UTF-8を扱う際には、文字化けやエラーが起こることがあります。

BeautifulSoupでのUTF-8の指定方法

BeautifulSoupでUTF-8を扱う際には、以下のように指定します。

soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')

ここで、html_docは解析するHTMLの文字列、html.parserは解析器の種類、from_encoding='utf-8'は文字エンコーディングの指定です。

BeautifulSoupとUTF-8を用いたWebスクレイピングの手順

BeautifulSoupとUTF-8を用いたWebスクレイピングの手順は以下の通りです。

URLからHTMLを取得する。
HTMLをUTF-8に変換する。
BeautifulSoupでHTMLを解析する。
必要な情報を抽出する。

BeautifulSoupでのUTF-8のエラーとその対処法

BeautifulSoupでUTF-8を扱う際には、以下のようなエラーが発生することがあります。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xXX in position X: invalid start byte
UnicodeEncodeError: 'ascii' codec can't encode characters in position X-Y: ordinal not in range(128)

これらのエラーが発生する場合、以下の対処法があります。

HTMLをUTF-8に変換する。
解析時にfrom_encoding='utf-8'を指定する。
文字列をunicode型に変換する。

BeautifulSoupとUTF-8を用いた応用例

以下は、天気予報サイトから週間天気を取得するプログラムの例です。なお、URLはダミーです。

import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
res = requests.get(url)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text, 'html.parser')
weather_list = []
for li in soup.find_all('li', class_='forecast'):
    weather = li.find('p', class_='weather').string
    temp_max = li.find('div', class_='high-temp temp').span.string
    temp_min = li.find('div', class_='low-temp temp').span.string
    weather_list.append((weather, temp_max, temp_min))
print(weather_list)

このプログラムでは、res.encoding = res.apparent_encodingでエンコーディングを自動判定しています。