Python 爬虫实战：Requests + 正则表达式抓取猫眼电影 TOP 100

1. 爬虫架构设计#

在不使用复杂框架（如 Scrapy）的情况下，一个轻量级爬虫的执行逻辑通常分为四个核心环节：

发送请求：通过 requests 模拟浏览器行为。
内容解析：利用 re（正则表达式）精准提取 HTML 字符串中的目标字段。
持久化存储：将非结构化文本转换为 JSON 格式并存入本地。
性能优化：引入 multiprocessing 多进程提升并行抓取速度。

2. 核心步骤详解#

2.1 获取网页内容#

现代网站通常会有基础的防爬检测。除了 User-Agent，我们有时还需要添加 Host 来提高请求的真实性。

2.2 正则表达式的“解剖学”#

正则表达式虽然强大，但难以阅读。在本项目中，我们匹配的是 <dd> 标签内的层级结构。

2.3 并发爬取效率对比#

单进程：按顺序爬取，耗时为。
多进程：利用 CPU 多核，耗时接近。

3. 完整代码实现#

1
import requests
2
from requests.exceptions import RequestException
3
import re
4
import json
5
from multiprocessing import Pool
6

7
def get_one_page(url):
8
    """获取单页 HTML 源码"""
9
    headers = {
10
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
11
        'Host': 'www.maoyan.com',
12
        'Referer': 'https://www.maoyan.com/board/4'
13
    }
14
    try:
15
        response = requests.get(url, headers=headers, timeout=10)
16
        if response.status_code == 200:
17
            return response.text
18
        return None
19
    except RequestException as e:
20
        print(f"请求异常: {e}")
21
        return None
22

23
def parse_one_page(html):
24
    """
25
    使用正则表达式解析。
26
    建议：在实际开发中，如果 HTML 结构极其复杂，可考虑使用 BeautifulSoup 或 lxml
27
    """
28
    pattern = re.compile(
29
        '<dd>.*?board-index.*?>(\d+)</i>'        # 排名
30
        +'.*?data-src="(.*?)"'                   # 封面图
31
        +'.*?name"><a.*?>(.*?)</a>'             # 电影名
32
        +'.*?star">(.*?)</p>'                   # 主演
33
        +'.*?releasetime">(.*?)</p>'            # 上映时间
34
        +'.*?integer">(.*?)</i>'                # 评分整数
35
        +'.*?fraction">(.*?)</i>.*?</dd>',      # 评分小数
36
        re.S
37
    )
38
    items = re.findall(pattern, html)
39
    for item in items:
40
        yield {
41
            'index': item[0],
42
            'image': item[1],
43
            'title': item[2].strip(),
44
            'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
45
            'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
46
            'score': item[5].strip() + item[6].strip()
47
        }
48

49
def write_to_file(content):
50
    """将结果追加写入 JSON Lines 文件"""
51
    with open('result.jsonl', 'a', encoding='utf-8') as f:
52
        f.write(json.dumps(content, ensure_ascii=False) + '\n')
53

54
def main(offset):
55
    url = f'https://www.maoyan.com/board/4?offset={offset}'
56
    html = get_one_page(url)
57
    if html:
58
        for item in parse_one_page(html):
59
            print(f"正在抓取排名 {item['index']}: {item['title']}")
60
            write_to_file(item)
61

62
if __name__ == '__main__':
63
    # 设定进程池，建议大小根据 CPU 核心数调整
64
    pool = Pool()
65
    offsets = [i * 10 for i in range(10)]
66
    pool.map(main, offsets)
67
    pool.close()
68
    pool.join()

4. 关键技术点分析#

re.S 参数：这是正则表达式的关键。HTML 中存在大量换行符，re.S 允许 . 匹配包括换行符在内的所有字符，从而实现跨行匹配。
JSON Lines 格式：代码中使用 a 模式追加写入并换行。这种格式比标准的 JSON 数组更适合爬虫，因为即使爬取中途崩溃，已存入的数据也不会因为缺少闭合方括号而失效。
多进程限制：猫眼等大型网站有频率限制（Anti-scraping）。在生产环境中，建议在 main 函数中加入 time.sleep(1)，避免因请求过快导致 IP 被封禁。

5. 总结#

Requests + Re 的组合虽然原始，但它是理解爬虫底层原理的最佳途径。通过本项目，我们掌握了：

如何构造合法的 HTTP 请求头。
如何利用正则的贪婪与非贪婪匹配提取字段。
如何利用多进程压榨硬件性能。

温馨提示：在爬取任何网站时，请务必遵守该站的 robots.txt 协议，尊重版权，切勿用于商业非法用途。