Python-100-Days-English/Day61-65/62.parsing-html-with-python.md at main · tamnd/Python-100-Days-English

Parsing HTML Pages with Python

In the earlier lessons, we talked about using the third-party library requests to get network resources, and we also introduced some basic frontend knowledge. Next, we continue to explore how to parse HTML code and extract useful information from pages. Before, we tried to use capturing groups in regular expressions to extract page content, but writing a correct regular expression is also something that gives people a headache. To solve this problem, we need to first understand the structure of an HTML page more deeply, and on that basis study other ways to parse a page.

Structure of an HTML Page

Open any website in the browser, then choose "View Page Source" from the right-click menu, and you can see the HTML code corresponding to the page.

Line 1 of the code is the document type declaration. The <html> tag on line 2 is the start tag of the root tag of the whole page, and the last line is the end tag of the root tag, </html>. Under the <html> tag there are two child tags, <head> and <body>. The content placed under the <body> tag is shown in the browser window, and this part is the main body of the web page. The content placed under the <head> tag is not shown in the browser window, but it contains important metadata of the page, usually called the head of the page. The rough code structure of an HTML page is shown below.

<!doctype html>
<html>
    <head>
        <!-- Page metadata, such as character encoding, title, keywords, media queries, and so on -->
    </head>
    <body>
        <!-- The main body of the page, the content shown in the browser window -->
    </body>
</html>

Tags, cascading style sheets, CSS, and JavaScript are the three main elements that make up an HTML page. Tags carry the content to be shown on the page, CSS is responsible for rendering the page, and JavaScript is used to control the interactive behavior of the page. To parse an HTML page, we can use XPath syntax. XPath was originally a query syntax for XML, and it can extract the content in tags or the tag attributes according to the hierarchy of HTML tags. Also, we can use CSS selectors to locate page elements, just like using CSS to render page elements.

XPath Parsing

XPath is a syntax for finding information in XML documents. XML, meaning eXtensible Markup Language, is similar to HTML because both are markup languages that carry data with tags. The difference is that XML tags are extensible and can be customized, and XML has stricter syntax requirements. XPath uses path expressions to select nodes or node sets in an XML document. The nodes here include elements, attributes, text, namespaces, processing instructions, comments, root nodes, and so on. Below, we use one example to explain how to use XPath to parse a page.

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book>
      <title lang="eng">Harry Potter</title>
      <price>29.99</price>
    </book>
    <book>
      <title lang="zh">Learning XML</title>
      <price>39.95</price>
    </book>
</bookstore>

For the XML file above, we can use the XPath syntax below to get nodes in the document.

Path Expression	Result
`/bookstore`	Select the root element bookstore. Notice: if a path starts with a slash `/`, then this path always represents the absolute path to an element.
`//book`	Select all `book` child elements, no matter where they are in the document.
`//@lang`	Select all attributes named `lang`.
`/bookstore/book[1]`	Select the first `book` element that is a child element of `bookstore`.
`/bookstore/book[last()]`	Select the last `book` element that is a child element of `bookstore`.
`/bookstore/book[last()-1]`	Select the second-to-last `book` element that is a child element of `bookstore`.
`/bookstore/book[position()<3]`	Select the first two `book` elements that are child elements of `bookstore`.
`//title[@lang]`	Select all `title` elements that have an attribute named `lang`.
`//title[@lang='eng']`	Select all `title` elements whose `lang` attribute value is `eng`.
`/bookstore/book[price>35.00]`	Select all `book` elements under `bookstore` whose `price` element value is greater than `35.00`.
`/bookstore/book[price>35.00]/title`	Select all `title` elements of `book` elements under `bookstore`, where the `price` element value is greater than `35.00`.

XPath also supports wildcard usage, as shown below.

Path Expression	Result
`/bookstore/*`	Select all child elements of the `bookstore` element.
`//*`	Select all elements in the document.
`//title[@*]`	Select all `title` elements that have attributes.

If you want to select multiple nodes, you can use the ways shown below.

Path Expression	Result
`//book/title \| //book/price`	Select all `title` and `price` elements of the `book` elements.
`//title \| //price`	Select all `title` and `price` elements in the document.
`/bookstore/book/title \| //price`	Select all `title` elements of `book` elements under `bookstore`, and all `price` elements in the document.

Note: The examples above come from the XPath tutorial on the Runoob website. Interested readers can read the original text by themselves.

Of course, if you do not understand or are not familiar with XPath syntax, you can check the XPath syntax of an element in the browser developer tools as shown below. The picture below shows checking the XPath syntax of a movie title in Douban movie detail information through Chrome Developer Tools.

To implement XPath parsing, the third-party library lxml is needed. You can use the command below to install lxml.

pip install lxml

Below, we rewrite the previous code for getting Douban Movie Top250 by using XPath parsing, as shown below.

from lxml import etree
import requests

for page in range(1, 11):
    resp = requests.get(
        url=f'https://movie.douban.com/top250?start={(page - 1) * 25}',
        headers={'User-Agent': 'BaiduSpider'}
    )
    tree = etree.HTML(resp.text)
    # Extract movie titles from the page through XPath syntax
    title_spans = tree.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]')
    # Extract movie ratings from the page through XPath syntax
    rank_spans = tree.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[2]/div/span[2]')
    for title_span, rank_span in zip(title_spans, rank_spans):
        print(title_span.text, rank_span.text)

CSS Selector Parsing

For developers who are familiar with CSS selectors and JavaScript, getting page elements through CSS selectors may be a simpler choice, because JavaScript running in the browser itself can use the querySelector() and querySelectorAll() methods of the document object to get page elements based on CSS selectors. In Python, we can use the third-party library beautifulsoup4 or pyquery to do the same thing. Beautiful Soup can be used to parse HTML and XML documents, repair documents that contain errors such as unclosed tags, and create a tree structure in memory for the page to be parsed, so it wraps the operation of extracting data from the page. You can use the command below to install Beautiful Soup.

pip install beautifulsoup4

Below is the code rewritten with bs4 to get the names of Douban Movie Top250 movies.

import bs4
import requests

for page in range(1, 11):
    resp = requests.get(
        url=f'https://movie.douban.com/top250?start={(page - 1) * 25}',
        headers={'User-Agent': 'BaiduSpider'}
    )
    # Create a BeautifulSoup object
    soup = bs4.BeautifulSoup(resp.text, 'lxml')
    # Extract span tags containing movie titles from the page through CSS selectors
    title_spans = soup.select('div.info > div.hd > a > span:nth-child(1)')
    # Extract span tags containing movie ratings from the page through CSS selectors
    rank_spans = soup.select('div.info > div.bd > div > span.rating_num')
    for title_span, rank_span in zip(title_spans, rank_spans):
        print(title_span.text, rank_span.text)

For more knowledge about BeautifulSoup, you can refer to its official documentation.

Summary

Below we make a simple comparison of the three parsing methods.

Parsing Method	Matching Module	Speed	Difficulty
Regular expression parsing	`re`	Fast	Hard
XPath parsing	`lxml`	Fast	Medium
CSS selector parsing	`bs4` or `pyquery`	Uncertain	Easy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing HTML Pages with Python

Structure of an HTML Page

XPath Parsing

CSS Selector Parsing

Summary

FilesExpand file tree

62.parsing-html-with-python.md

Latest commit

History

62.parsing-html-with-python.md

File metadata and controls

Parsing HTML Pages with Python

Structure of an HTML Page

XPath Parsing

CSS Selector Parsing

Summary