In the earlier lessons, we talked about using the third-party library requests to get network resources, and we also introduced some basic frontend knowledge. Next, we continue to explore how to parse HTML code and extract useful information from pages. Before, we tried to use capturing groups in regular expressions to extract page content, but writing a correct regular expression is also something that gives people a headache. To solve this problem, we need to first understand the structure of an HTML page more deeply, and on that basis study other ways to parse a page.
Open any website in the browser, then choose "View Page Source" from the right-click menu, and you can see the HTML code corresponding to the page.
Line 1 of the code is the document type declaration. The <html> tag on line 2 is the start tag of the root tag of the whole page, and the last line is the end tag of the root tag, </html>. Under the <html> tag there are two child tags, <head> and <body>. The content placed under the <body> tag is shown in the browser window, and this part is the main body of the web page. The content placed under the <head> tag is not shown in the browser window, but it contains important metadata of the page, usually called the head of the page. The rough code structure of an HTML page is shown below.
<!doctype html>
<html>
<head>
<!-- Page metadata, such as character encoding, title, keywords, media queries, and so on -->
</head>
<body>
<!-- The main body of the page, the content shown in the browser window -->
</body>
</html>Tags, cascading style sheets, CSS, and JavaScript are the three main elements that make up an HTML page. Tags carry the content to be shown on the page, CSS is responsible for rendering the page, and JavaScript is used to control the interactive behavior of the page. To parse an HTML page, we can use XPath syntax. XPath was originally a query syntax for XML, and it can extract the content in tags or the tag attributes according to the hierarchy of HTML tags. Also, we can use CSS selectors to locate page elements, just like using CSS to render page elements.
XPath is a syntax for finding information in XML documents. XML, meaning eXtensible Markup Language, is similar to HTML because both are markup languages that carry data with tags. The difference is that XML tags are extensible and can be customized, and XML has stricter syntax requirements. XPath uses path expressions to select nodes or node sets in an XML document. The nodes here include elements, attributes, text, namespaces, processing instructions, comments, root nodes, and so on. Below, we use one example to explain how to use XPath to parse a page.
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="zh">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>For the XML file above, we can use the XPath syntax below to get nodes in the document.
| Path Expression | Result |
|---|---|
/bookstore |
Select the root element bookstore. Notice: if a path starts with a slash /, then this path always represents the absolute path to an element. |
//book |
Select all book child elements, no matter where they are in the document. |
//@lang |
Select all attributes named lang. |
/bookstore/book[1] |
Select the first book element that is a child element of bookstore. |
/bookstore/book[last()] |
Select the last book element that is a child element of bookstore. |
/bookstore/book[last()-1] |
Select the second-to-last book element that is a child element of bookstore. |
/bookstore/book[position()<3] |
Select the first two book elements that are child elements of bookstore. |
//title[@lang] |
Select all title elements that have an attribute named lang. |
//title[@lang='eng'] |
Select all title elements whose lang attribute value is eng. |
/bookstore/book[price>35.00] |
Select all book elements under bookstore whose price element value is greater than 35.00. |
/bookstore/book[price>35.00]/title |
Select all title elements of book elements under bookstore, where the price element value is greater than 35.00. |
XPath also supports wildcard usage, as shown below.
| Path Expression | Result |
|---|---|
/bookstore/* |
Select all child elements of the bookstore element. |
//* |
Select all elements in the document. |
//title[@*] |
Select all title elements that have attributes. |
If you want to select multiple nodes, you can use the ways shown below.
| Path Expression | Result |
|---|---|
//book/title | //book/price |
Select all title and price elements of the book elements. |
//title | //price |
Select all title and price elements in the document. |
/bookstore/book/title | //price |
Select all title elements of book elements under bookstore, and all price elements in the document. |
Note: The examples above come from the XPath tutorial on the Runoob website. Interested readers can read the original text by themselves.
Of course, if you do not understand or are not familiar with XPath syntax, you can check the XPath syntax of an element in the browser developer tools as shown below. The picture below shows checking the XPath syntax of a movie title in Douban movie detail information through Chrome Developer Tools.
To implement XPath parsing, the third-party library lxml is needed. You can use the command below to install lxml.
pip install lxmlBelow, we rewrite the previous code for getting Douban Movie Top250 by using XPath parsing, as shown below.
from lxml import etree
import requests
for page in range(1, 11):
resp = requests.get(
url=f'https://movie.douban.com/top250?start={(page - 1) * 25}',
headers={'User-Agent': 'BaiduSpider'}
)
tree = etree.HTML(resp.text)
# Extract movie titles from the page through XPath syntax
title_spans = tree.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]')
# Extract movie ratings from the page through XPath syntax
rank_spans = tree.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[2]/div/span[2]')
for title_span, rank_span in zip(title_spans, rank_spans):
print(title_span.text, rank_span.text)For developers who are familiar with CSS selectors and JavaScript, getting page elements through CSS selectors may be a simpler choice, because JavaScript running in the browser itself can use the querySelector() and querySelectorAll() methods of the document object to get page elements based on CSS selectors. In Python, we can use the third-party library beautifulsoup4 or pyquery to do the same thing. Beautiful Soup can be used to parse HTML and XML documents, repair documents that contain errors such as unclosed tags, and create a tree structure in memory for the page to be parsed, so it wraps the operation of extracting data from the page. You can use the command below to install Beautiful Soup.
pip install beautifulsoup4Below is the code rewritten with bs4 to get the names of Douban Movie Top250 movies.
import bs4
import requests
for page in range(1, 11):
resp = requests.get(
url=f'https://movie.douban.com/top250?start={(page - 1) * 25}',
headers={'User-Agent': 'BaiduSpider'}
)
# Create a BeautifulSoup object
soup = bs4.BeautifulSoup(resp.text, 'lxml')
# Extract span tags containing movie titles from the page through CSS selectors
title_spans = soup.select('div.info > div.hd > a > span:nth-child(1)')
# Extract span tags containing movie ratings from the page through CSS selectors
rank_spans = soup.select('div.info > div.bd > div > span.rating_num')
for title_span, rank_span in zip(title_spans, rank_spans):
print(title_span.text, rank_span.text)For more knowledge about BeautifulSoup, you can refer to its official documentation.
Below we make a simple comparison of the three parsing methods.
| Parsing Method | Matching Module | Speed | Difficulty |
|---|---|---|---|
| Regular expression parsing | re |
Fast | Hard |
| XPath parsing | lxml |
Fast | Medium |
| CSS selector parsing | bs4 or pyquery |
Uncertain | Easy |

