Data Scraping with BeautifulSoup and Selenium in Python | Jim Zhang's blog
Data Scraping with BeautifulSoup and Selenium in Python2023-02-27

Please write something...

Installation

The package BeautifulSoup is named after the tenth chapter of Alice's Adventures in Wonderland by Lewis Carroll:

Beautiful Soup, so rich and green,
Waiting in a hot tureen!
Who for such dainties would not stoop?
Soup of the evening, beautiful Soup!
Soup of the evening, beautiful Soup!
...
Chapter 10, Alice's Adventures in Wonderland, Lewis Carroll.

That is, "turning the rotten into the magical". The installation of BeautifulSoup and Selenium is very easy, just one line of code:

$ pip install BeautifulSoup4 selenium

Please note that it is not pip install BeautifulSoup, but pip install BeautifulSoup4, because the former refers to BeautifulSoup3.

After the installation is complete, we can check whether it is successfully installed through the interactive console of python, for example:

Python 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup, import selenium
>>>

If there is no error message, it means that the installation is successful.

BeautifulSoup

Local html file

First, we can use a local html file for testing. The content of this file is as follows:

<!--test.html-->
<html>
    <head>
        <title>This is a test.</title>
    </head>
    <body>
        <h1>Test</h1>
        <p>This is a test.</p>
        <p>This is the second test.</p>
    </body>
</html>

It corresponds to the DOM (Document Object Model) tree as follows:

Document
└── html
    ├── head
    │   └── title
    └── body
        ├── h1
        ├── p
        └── p

The tree-like structure is called the DOM tree. The DOM tree tells us the relationship between the elements in the HTML file:

  • Document is the root node of the DOM tree, and the html element is its child node.
  • The html is the root element of the HTML file, and the head and body elements are its child nodes.
  • The head element is the head element of the HTML file, and the title element is its child node.
  • The body element is the body element of the HTML file, and the h1, p and p elements are its child nodes.
  • title, h1 and p are all leaf nodes.

The BeautifulSoup package can parse the DOM tree of the HTML file, for example:

from bs4 import BeautifulSoup

with open('test.html', 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'html.parser')

print(soup.prettify())

The output is as follows:

<html>
 <head>
  <title>
   This is a test.
  </title>
 </head>
 <body>
  <h1>
   Test
  </h1>
  <p>
   This is a test.
  </p>
  <p>
   This is the second test.
  </p>
 </body>
</html>

Then, we can read the content of the HTML file using the variable soup. Furthermore, we can find the child elements of soup. As an instance:

print(soup.html.head.title)

The output is as follows:

<title>This is a test.</title>

We can easily get it that soup.html.head.title is the title element of the HTML file. However, the title element is the only element and it can be easily obtained. Simiar method cannot be used to get a certain element which has multiple same tags.

In this situation, we can use the find_all() method to get all the elements with the same tag. For example:

print(soup.find_all('p'))

The output is as follows:

[<p>This is a test.</p>, <p>This is the second test.</p>]

For content of the element, we can use the text attribute. For example:

print(soup.title.text)

The output is as follows:

This is a test.

Remote html file / web page

Now let's try to get the content of a remote html file. Before using BeautifulSoup to parse the DOM tree, we need to get the content of the remote html file. We can use the requests package to get the content of the remote html file. For example: