beautifulsoup decomposeの使い方(all, navigablestring) ｜自作で機械学習モデル・AIの使い方を学ぶ

beautifulsoupとは

beautifulsoupはPythonのライブラリで、HTMLやXMLなどの構造化されたデータから情報を取得することができます。HTMLの解析には特に優れており、Webスクレイピングなどの処理によく使われます。

decomposeの基本的な使い方

decomposeは、beautifulsoupで要素を削除するために使われる関数です。次のように書くことで、要素を削除することができます。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
target = soup.find('div', class_='target')
target.decompose()

ここで、htmlは解析するHTMLデータ、targetは削除対象の要素を指定します。

decompose allの使い方

decompose allは、指定した要素の子孫要素を全て削除することができます。次のように書くことで、子孫要素を全て削除することができます。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
target = soup.find('div', class_='target')
target.decompose()
for child in target.descendants:
    child.decompose()

ここで、descendantsは指定した要素の子孫要素を全て取得するためのメソッドです。

decompose navigablestringの使い方

decompose navigablestringは、指定した要素内のテキストを削除することができます。次のように書くことで、テキストを削除することができます。

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(html, 'html.parser')
target = soup.find('div', class_='target')
for string in target.text:
    if isinstance(string, NavigableString):
        string.replace_with('')

ここで、stringsは指定した要素内のテキストを全て取得するためのメソッドです。

decomposeの注意点

decomposeを使う場合は、削除した要素が参照されないように注意する必要があります。また、削除した要素を再利用することはできません。

beautifulsoup decomposeの応用例

以下のようなHTMLデータがあるとします。

<html>
    <head>
        <title>example</title>
    </head>
    <body>
        <div class="target">
            <h1>title</h1>
            <p>text</p>
            <a href="http://example.com">link</a>
        </div>
        <div class="target">
            <h1>title2</h1>
            <p>text2</p>
            <a href="http://example2.com">link2</a>
        </div>
    </body>
</html>

このHTMLデータから、classがtargetであるdiv要素とその子孫要素を全て削除するには、次のように書きます。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for target in soup.find_all('div', class_='target'):
    target.decompose()
print(soup.prettify())

この場合、classがtargetであるdiv要素とその子孫要素が全て削除され、次のようなHTMLが出力されます。

<html>
 <head>
  <title>
   example
  </title>
 </head>
 <body>
 </body>
</html>

まとめ

beautifulsoup decomposeは、HTMLやXMLなどの構造化されたデータから要素を削除するための関数です。decompose allは、指定した要素の子孫要素を全て削除することができ、decompose navigablestringは、指定した要素内のテキストを削除することができます。ただし、削除した要素は参照されないように注意する必要があります。