Beautiful Soup 基础教程
官方链接:
- 官方主页: https://www.crummy.com/software/BeautifulSoup/
- 官方文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- PyPi页面: https://pypi.org/project/beautifulsoup4/
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Beautiful Soup 是一个 Python 库,用于从 HTML 和 XML 文件中提取数据。它可与您常用的解析器配合使用,提供符合惯用习惯的方式来浏览、搜索和修改解析树。它通常能为程序员节省数小时甚至数天的工作时间。
These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations.
这些说明通过示例展示了 Beautiful Soup 4 的所有主要功能。我将向您展示该库的用途、工作原理、使用方法、如何使其实现您想要的功能,以及当它不符合您的预期时该如何处理。
This document covers Beautiful Soup version 4.14.2. The examples in this documentation were written for Python 3.8.
本文档适用于 Beautiful Soup 4.14.2 版本。本文档中的示例是为 Python 3.8 编写的。
You might be looking for the documentation for Beautiful Soup 3. If so, you should know that Beautiful Soup 3 is no longer being developed and that all support for it was dropped on December 31, 2020. If you want to learn about the differences between Beautiful Soup 3 and Beautiful Soup 4, see Porting code to BS4.
您可能正在寻找 Beautiful Soup 3 的文档。如果是这样,您应该知道 Beautiful Soup 3 已停止开发,并且所有支持已于 2020 年 12 月 31 日终止。如果您想了解 Beautiful Soup 3 和 Beautiful Soup 4 之间的区别,请参阅“将代码移植到 BS4”。
If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document.
如果您对 Beautiful Soup 有任何疑问或遇到问题, 请发送邮件至讨论组。如果您的疑问涉及解析 HTML 文档,请务必说明 diagnose() 函数对该文档的诊断结果。
When reporting an error in this documentation, please mention which translation you’re reading.
报告本文档中的错误时,请说明您正在阅读的是哪个译本。
This document is written like an instruction manual, but you can also read traditional API documentation generated from the Beautiful Soup source code. If you want details about Beautiful Soup’s internals, or a feature not covered in this document, try the API documentation.
本文档以操作手册的形式编写,但您也可以阅读 根据 Beautiful Soup 源代码生成的传统 API 文档 。如果您想了解 Beautiful Soup 的内部机制或本文档未涵盖的功能,请查阅 API 文档。
Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
Here are some simple ways to navigate that data structure:
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
One common task is extracting all the URLs found within a page’s tags:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
Another common task is extracting all the text from a page:
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
Does this look like what you need? If so, read on.
下面给你一份 最实用、最简洁的 Beautiful Soup 使用指南,涵盖常用操作:解析、查找、选择器、提取属性、遍历、修改、提交表单等。你可以直接复制用。
🌱 BeautifulSoup 最常用速查表(Python)
pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
html = "<html><body><h1>Hello</h1></body></html>"
soup = BeautifulSoup(html, "lxml")
print(soup.h1.text)
soup.find("div")
soup.find("div", id="main")
soup.find("div", class_="item")
soup.find_all("a")
soup.find_all("a", limit=5)
soup.select("div.item a")
soup.select(".price")
soup.select("#main > ul > li")
soup.select_one(".title").text
文本
element.text.strip()
element.get_text()
属性
element["href"]
element.get("src")
tag.parent
tag.parents
tag.children
tag.descendants
tag.next_sibling
tag.previous_sibling
tag = soup.find("h1")
tag.string = "New Title"
new_div = soup.new_tag("div", **{"class": "box"})
soup.body.append(new_div)
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
for a in soup.select("a"):
print(a.text, a["href"])
rows = soup.select("table tr")
for row in rows:
cols = [td.text.strip() for td in row.select("td, th")]
print(cols)
soup.find_all(string=lambda text: "关键字" in text)
for tag in soup(["script", "style"]):
tag.decompose()
import requests
from bs4 import BeautifulSoup
html = requests.get("https://news.ycombinator.com").text
soup = BeautifulSoup(html, "lxml")
for item in soup.select(".titleline > a"):
print(item.text, item["href"])