Regex compile (正则表达式编译)

下面是一份项目级的《Python 正则表达式编译与最佳实践》，涵盖：

正则编译最佳方式
性能优化
项目中推荐的使用模式
常见错误避免
提供完整可用的 utils.regex 工具模块

全部兼容 Python 3.12 / BasedPyright，可直接加入你的项目。

✅ 1. 为什么要“编译”正则表达式？

Python 的正则模块 re 工作方式：

写法	是否每次都重新编译正则？	性能
re.search(pattern, text)	✅ 是	❌ 较慢
compiled = re.compile(pattern); compiled.search(text)	❌ 否（复用）	✅ 快

在循环 / 高频调用 / 大规模数据处理 / HTTP请求爬虫 / 定时任务中，推荐：

正则提前编译
重复复用
不要在循环中写 re.search(pattern, text)

✅ 2. 正则编译最佳实践写法

单个模式

PAT_STOCK_CODE = re.compile(r"^(00|60)\d+")

分组捕获

PAT_EQUAL = re.compile(r"=(.+)$")

支持中文的严格模式

PAT_DATE_CN = re.compile(r"(\d{4})年(\d{2})月(\d{2})日")

带 flags

PAT_MULTI = re.compile(r"foo.*bar", re.DOTALL)

✅ 3. 正则常用 flags 最佳实践（项目级）

Flag	含义	使用场景
re.I	忽略大小写	login、email
re.M	多行	批量解析日志
re.S	dot 匹配换行（.->匹配任意字符）	抓取 HTML 块
re.X	允许空白注释	长 regex 增强可读性
re.A	ASCII 模式	性能优化
re.U	Unicode 模式	处理中文/emoji

✅ 4. 正则最佳实践（项目规范）

4.1 规则 1：所有正则都必须是原始字符串 r””

❌ 错误

"\d+"
```py

✅ 正确

```py
r"\d+"

4.2 规则 2：循环中绝对不要写 re.search / re.sub

❌ 错误

for item in items:
    if re.search(r"\d+", item): ...

✅ 正确

PAT_NUM = re.compile(r"\d+")
for item in items:
    if PAT_NUM.search(item): ...

4.3 规则 3：简单任务不要用 regex（优先用 str 方法）

比如：

替换固定内容

s.replace("abc", "def")

分割固定字符

s.split(",")

只有当：

动态匹配
用到逻辑 OR
异类字符处理
提取结构化文本

才使用 regex。

4.4 规则 4：复杂 regex 必须分行 + 注释（re.X）

示例：匹配 YYYY-MM-DD

PAT_DATE = re.compile(
    r"""
    ^                   # 行开始
    (\d{4})             # 年
    -                   # 分隔符
    (0[1-9]|1[0-2])     # 月
    -                   # 分隔符
    (0[1-9]|[12]\d|3[01]) # 日
    $
    """,
    re.X
)

✅ 5. 常见正则模式（项目常用）

5.1 提取 URL 参数

PAT_KV = re.compile(r"[?&](\w+)=([^&]+)")

5.2 股票代码：00 或 60 开头

PAT_STOCK = re.compile(r"^(00|60)\d{4}$")

5.3 移除 HTML 标签

PAT_TAG = re.compile(r"<[^>]+>")

5.4 中文日期 “2025年12月04日”

PAT_DATE_CN = re.compile(r"(\d{4})年(\d{1,2})月(\d{1,2})日")

✅ 6. 正则处理的工具函数（推荐放 utils/regex.py）

import re
from typing import Iterable


def compile_pattern(pattern: str, flags=0) -> re.Pattern:
    """统一的正则编译封装"""
    return re.compile(pattern, flags)


def regex_search(pattern: re.Pattern, text: str) -> str | None:
    """返回捕获组 1 的内容"""
    m = pattern.search(text)
    return m.group(1) if m else None


def regex_split(pattern: re.Pattern, text: str) -> list[str]:
    """正则分割封装"""
    return pattern.split(text)


def regex_findall(pattern: re.Pattern, text: str) -> list[str]:
    """批量查找匹配"""
    return pattern.findall(text)


def regex_sub(pattern: re.Pattern, repl: str, text: str) -> str:
    """替换"""
    return pattern.sub(repl, text)

✅ 7. 实战：项目中常用的正则编译方式

7.1 示例：过滤无效标题（你项目中使用）

TITLE_EXCLUDE = re.compile(
    r"板块局部异动|持续活跃|持续下挫|持续走强|触发二次临停|概念股表现活跃",
)

7.2 中文日期转换

DATE_CN = re.compile(r"(\d{4})年(\d{2})月(\d{2})日")

使用：

DATE_CN.sub(r"\1-\2-\3", s)

✅ 8. 性能优化最佳实践

8.1 compile 只做一次

模块级别定义：

PAT_CODE = re.compile(r"^\d+$")

8.2 大量文本处理使用 finditer （比 findall 更省内存）

for m in PAT_CODE.finditer(text):
    ...

8.3 避免复杂回溯（使用 +? 和 .*?）

❌ 回溯非常慢

.*</div>

✅ 限制范围

[^<]+

re.compile 编译 list

你可能想做的是：

把一个字符串列表（多个正则模式）一次性编译成正则对象，并统一匹配。

Python 的 re.compile() 只能接受单个字符串，不能直接对 list 编译。

但有两种最佳实践方式：

把多个模式合成一个正则再 compile（推荐）
对每个模式分别 compile 成多个正则对象

下面分别讲述最佳实践。

方法 1：把多个模式合并成一个大正则（推荐）

适用于：

多个关键词任意一个匹配即可
大量重复匹配
性能需求高
项目中常用于过滤标题、关键词触发、黑名单匹配

比如：

patterns = [
    "持续活跃",
    "持续下挫",
    "反复走强",
]

🔥 一次性编译：

import re

def compile_pattern_list(patterns: list[str]) -> re.Pattern:
    combined = "|".join(map(re.escape, patterns))
    return re.compile(combined)

使用：

TITLE_EXCLUDE = compile_pattern_list(patterns)

if TITLE_EXCLUDE.search(title):
    print("命中排除项")

🔥 优点

仅编译一次 -> 性能高
正则对象可复用
使用简单 (pattern.search())
支持大量模式（几十、几百都没问题）
防止特殊字符冲突（因为使用了 re.escape）

⚠️ 注意

如果要写复杂模式，不需要 escape，可以改成：

combined = "|".join(patterns)

方法 2：为每个 pattern 分别 compile（保留模式独立性）

适用于：

每个模式处理逻辑不一样
模式之间不能合并
需要知道命中的是哪个 pattern

import re

patterns = [
    r"^00\d+",
    r"^60\d+",
    r"ST",
]

compiled_list = [re.compile(p) for p in patterns]

for pat in compiled_list:
    if pat.search(text):
        print("命中:", pat.pattern)

⏱ 性能对比（重要）

模式数量	合并编译	分别编译
3 个	差不多	差不多
20 个	快	较慢
100 个	明显更快（1 次 regex engine）	100 次匹配循环
500 个	仍很快	性能明显下降

如果你的项目里：

大量文本需要过滤
模式数量多（>20）
高频调用（循环、异步、爬虫）

👉 推荐使用合并编译（方法 1）

方法 3：缓存 compiled list（项目级最佳实践）

放在 utils/regex.py 中：

import re
from functools import lru_cache

@lru_cache(maxsize=None)
def compile_patterns(patterns: tuple[str, ...]) -> re.Pattern:
    """把 list 编译为一个超大正则（带缓存）"""
    combined = "|".join(map(re.escape, patterns))
    return re.compile(combined)

使用：

patterns = ("持续活跃", "跌势扩大", "开盘走强")

TITLE_EXCLUDE = compile_patterns(patterns)

if TITLE_EXCLUDE.search(title):
    ...

优点：

list 变成 tuple -> 可缓存
多次调用不会重复编译
项目级性能提升显著

🎯 项目实战示例（结合你之前的 title_exclude_list）

你的原代码：

title_exclude_list = [...很多关键词...]
title_exclude_string = "|".join(title_exclude_list)
title_exclude = re.compile(title_exclude_string)

最佳优化：

from utils.regex import compile_patterns

TITLE_EXCLUDE = compile_patterns(tuple(title_exclude_list))

if TITLE_EXCLUDE.search(title):
    return False

自动 escape
性能更好
代码更优雅
patterns 可缓存

re.compile 编译 list

你可能想做的是：

把一个字符串列表（多个正则模式）一次性编译成正则对象，并统一匹配。

Python 的 re.compile() 只能接受 单个字符串，不能直接对 list 编译。

但有两种最佳实践方式：

把多个模式合成一个正则再 compile（推荐）
对每个模式分别 compile 成多个正则对象

下面分别讲述最佳实践。

✅ 方法 1：把多个模式合并成一个大正则（推荐）

适用于：

多个关键词任意一个匹配即可
大量重复匹配
性能需求高
项目中常用于过滤标题、关键词触发、黑名单匹配

比如：

patterns = [
    "持续活跃",
    "持续下挫",
    "反复走强",
]

🔥 一次性编译：

import re

def compile_pattern_list(patterns: list[str]) -> re.Pattern:
    combined = "|".join(map(re.escape, patterns))
    return re.compile(combined)

使用：

TITLE_EXCLUDE = compile_pattern_list(patterns)

if TITLE_EXCLUDE.search(title):
    print("命中排除项")

🔥 优点

仅编译一次 -> 性能高
正则对象可复用
使用简单 (pattern.search())
支持大量模式（几十、几百都没问题）
防止特殊字符冲突（因为使用了 re.escape）

⚠️ 注意

如果要写复杂模式，不需要 escape，可以改成：

combined = "|".join(patterns)

✅ 方法 2：为每个 pattern 分别 compile（保留模式独立性）

适用于：

每个模式处理逻辑不一样
模式之间不能合并
需要知道命中的是哪个 pattern

import re

patterns = [
    r"^00\d+",
    r"^60\d+",
    r"ST",
]

compiled_list = [re.compile(p) for p in patterns]

for pat in compiled_list:
    if pat.search(text):
        print("命中:", pat.pattern)

⏱ 性能对比（重要）

模式数量	合并编译	分别编译
3 个	差不多	差不多
20 个	快	较慢
100 个	明显更快（1 次 regex engine）	100 次匹配循环
500 个	仍很快	性能明显下降

如果你的项目里：

大量文本需要过滤
模式数量多（>20）
高频调用（循环、异步、爬虫）

👉 推荐使用合并编译（方法 1）

✅ 方法 3：缓存 compiled list（项目级最佳实践）

放在 utils/regex.py 中：

import re
from functools import lru_cache

@lru_cache(maxsize=None)
def compile_patterns(patterns: tuple[str, ...]) -> re.Pattern:
    """把 list 编译为一个超大正则（带缓存）"""
    combined = "|".join(map(re.escape, patterns))
    return re.compile(combined)

使用：

patterns = ("持续活跃", "跌势扩大", "开盘走强")

TITLE_EXCLUDE = compile_patterns(patterns)

if TITLE_EXCLUDE.search(title):
    ...

优点：

list 变成 tuple -> 可缓存
多次调用不会重复编译
项目级性能提升显著

🎯 项目实战示例（结合你之前的 title_exclude_list）

你的原代码：

title_exclude_list = [...很多关键词...]
title_exclude_string = "|".join(title_exclude_list)
title_exclude = re.compile(title_exclude_string)

最佳优化：

from utils.regex import compile_patterns

TITLE_EXCLUDE = compile_patterns(tuple(title_exclude_list))

if TITLE_EXCLUDE.search(title):
    return False

自动 escape
性能更好
代码更优雅
patterns 可缓存