Please write something...
Installation
The package BeautifulSoup
is named after the tenth chapter of Alice's Adventures in Wonderland by Lewis Carroll:
Beautiful Soup, so rich and green,
Waiting in a hot tureen!
Who for such dainties would not stoop?
Soup of the evening, beautiful Soup!
Soup of the evening, beautiful Soup!
...
Chapter 10, Alice's Adventures in Wonderland, Lewis Carroll.
That is, "turning the rotten into the magical". The installation of BeautifulSoup and Selenium is very easy, just one line of code:
$ pip install BeautifulSoup4 selenium
Please note that it is not pip install BeautifulSoup
, but pip install BeautifulSoup4
, because the former refers to BeautifulSoup3
.
After the installation is complete, we can check whether it is successfully installed through the interactive console of python
, for example:
Python 3.10.7 (tags/v3.10.7:6cc6b13, Sep 5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from bs4 import BeautifulSoup, import selenium >>>
If there is no error message, it means that the installation is successful.
BeautifulSoup
Local html file
First, we can use a local html file for testing. The content of this file is as follows:
<!--test.html-->
<html>
<head>
<title>This is a test.</title>
</head>
<body>
<h1>Test</h1>
<p>This is a test.</p>
<p>This is the second test.</p>
</body>
</html>
It corresponds to the DOM (Document Object Model) tree as follows:
Document └── html ├── head │ └── title └── body ├── h1 ├── p └── p
The tree-like structure is called the DOM tree. The DOM tree tells us the relationship between the elements in the HTML file:
Document
is the root node of the DOM tree, and thehtml
element is its child node.- The
html
is the root element of the HTML file, and thehead
andbody
elements are its child nodes. - The
head
element is the head element of the HTML file, and thetitle
element is its child node. - The
body
element is the body element of the HTML file, and theh1
,p
andp
elements are its child nodes. title
,h1
andp
are all leaf nodes.
The BeautifulSoup
package can parse the DOM tree of the HTML file, for example:
from bs4 import BeautifulSoup
with open('test.html', 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'html.parser')
print(soup.prettify())
The output is as follows:
<html> <head> <title> This is a test. </title> </head> <body> <h1> Test </h1> <p> This is a test. </p> <p> This is the second test. </p> </body> </html>
Then, we can read the content of the HTML file using the variable soup
. Furthermore, we can find the child elements of soup
. As an instance:
print(soup.html.head.title)
The output is as follows:
<title>This is a test.</title>
We can easily get it that soup.html.head.title
is the title
element of the HTML file. However, the title
element is the only element and it can be easily obtained. Simiar method cannot be used to get a certain element which has multiple same tags.
In this situation, we can use the find_all()
method to get all the elements with the same tag. For example:
print(soup.find_all('p'))
The output is as follows:
[<p>This is a test.</p>, <p>This is the second test.</p>]
For content of the element, we can use the text
attribute. For example:
print(soup.title.text)
The output is as follows:
This is a test.
Remote html file / web page
Now let's try to get the content of a remote html file. Before using BeautifulSoup
to parse the DOM tree, we need to get the content of the remote html file. We can use the requests
package to get the content of the remote html file. For example: