Python reモジュールの使い方：正規表現をマスターしよう｜自作で機械学習モデル・AIの使い方を学ぶ

Pythonのreモジュールとは

Pythonには、正規表現を扱うためのreモジュールがあります。正規表現は、文字列のパターンを表すための表現方法であり、様々な文字列処理に使われます。

正規表現の基本的なパターン

正規表現には、様々なパターンがあります。ここでは、基本的なパターンを紹介します。

.：任意の1文字を表す。
*：直前の文字が0回以上繰り返されることを表す。
+：直前の文字が1回以上繰り返されることを表す。
?：直前の文字が0回または1回現れることを表す。
|：2つのパターンのどちらか1つがマッチすることを表す。
()：グルーピングを表す。
[]：文字クラスを表す。
^：行の先頭を表す。
$：行の末尾を表す。

reモジュールの主要な関数とその使い方

reモジュールには、様々な関数があります。ここでは、主要な関数を紹介します。

re.match(pattern, string)

文字列の先頭からパターンにマッチするかどうかを調べます。

例：

import re
pattern = r"hello"
string = "hello world"
match = re.match(pattern, string)
print(match)

出力：

<re.Match object; span=(0, 5), match='hello'>

re.search(pattern, string)

文字列内でパターンにマッチする箇所を探します。

例：

import re
pattern = r"world"
string = "hello world"
match = re.search(pattern, string)
print(match)

出力：

<re.Match object; span=(6, 11), match='world'>

re.findall(pattern, string)

文字列内でパターンにマッチするすべての箇所をリストとして返します。

例：

import re
pattern = r"o"
string = "hello world"
match = re.findall(pattern, string)
print(match)

出力：

['o', 'o']

re.sub(pattern, repl, string)

文字列内でパターンにマッチする部分を、指定した文字列に置換します。

例：

import re
pattern = r"world"
string = "hello world"
repl = "Python"
new_string = re.sub(pattern, repl, string)
print(new_string)

出力：

hello Python

reモジュールを使った実例

ここでは、reモジュールを使った実例を紹介します。

メールアドレスのバリデーション

メールアドレスの文字列が正しい形式かどうかをチェックする関数を作成します。

import re
def validate_email(email):
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    return re.match(pattern, email) is not None
print(validate_email("example@example.com"))  # True
print(validate_email("example@example"))  # False

URLの抽出

文字列内に含まれるURLを抽出する関数を作成します。

import re
def extract_urls(text):
    pattern = r"(https?://[\w/:%#\\?\(\)~\.=\+\-]+)" 
    return re.findall(pattern, text) 

text = "サイトはhttps://www.example.com/です。" 
urls = extract_urls(text) 
print(urls) 
# ['https://www.example.com/']

パターンマッチングにおける注意点

正規表現を使ったパターンマッチングにおいては、以下の点に注意が必要です。

パターンが複雑になると、処理時間がかかることがある。
パターンによっては、マッチする文字列が複数ある場合がある。
パターンによっては、マッチする文字列がない場合がある。

正規表現の高度なテクニック

正規表現には、高度なテクニックもあります。ここでは、いくつか紹介します。

肯定先読み

マッチする文字列の直後に指定の文字列がある場合にマッチするパターンを表現します。

import re
pattern = r"\d+(?=円)"
text = "価格は1000円です。"
match = re.search(pattern, text)
print(match.group())  # 1000

否定先読み

マッチする文字列の直後に指定の文字列がない場合にマッチするパターンを表現します。

import re
pattern = r"\d+(?!円)"
text = "価格は1000ドルです。"
match = re.search(pattern, text)
print(match.group())  # 1000

非貪欲マッチング

マッチする文字列の最小限の長さにマッチするパターンを表現します。

import re
pattern = r"<.*?>"
text = '<a href="http://example.com">example</a>'
match = re.search(pattern, text)
print(match.group())  # <a href='http://example.com'>