Please write something...
Installation
The package BeautifulSoup is named after the tenth chapter of Alice's Adventures in Wonderland by Lewis Carroll:
Beautiful Soup, so rich and green,
Waiting in a hot tureen!
Who for such dainties would not stoop?
Soup of the evening, beautiful Soup!
Soup of the evening, beautiful Soup!
...
Chapter 10, Alice's Adventures in Wonderland, Lewis Carroll.
That is, "turning the rotten into the magical". The installation of BeautifulSoup and Selenium is very easy, just one line of code:
$ pip install BeautifulSoup4 selenium
Please note that it is not pip install BeautifulSoup, but pip install BeautifulSoup4, because the former refers to BeautifulSoup3.
After the installation is complete, we can check whether it is successfully installed through the interactive console of python, for example:
Python 3.10.7 (tags/v3.10.7:6cc6b13, Sep 5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from bs4 import BeautifulSoup, import selenium >>>
If there is no error message, it means that the installation is successful.
BeautifulSoup
Local html file
First, we can use a local html file for testing. The content of this file is as follows:
<!--test.html-->
<html>
<head>
<title>This is a test.</title>
</head>
<body>
<h1>Test</h1>
<p>This is a test.</p>
<p>This is the second test.</p>
</body>
</html>
It corresponds to the DOM (Document Object Model) tree as follows:
Document
└── html
├── head
│ └── title
└── body
├── h1
├── p
└── p
The tree-like structure is called the DOM tree. The DOM tree tells us the relationship between the elements in the HTML file:
Documentis the root node of the DOM tree, and thehtmlelement is its child node.- The
htmlis the root element of the HTML file, and theheadandbodyelements are its child nodes. - The
headelement is the head element of the HTML file, and thetitleelement is its child node. - The
bodyelement is the body element of the HTML file, and theh1,pandpelements are its child nodes. title,h1andpare all leaf nodes.
The BeautifulSoup package can parse the DOM tree of the HTML file, for example:
from bs4 import BeautifulSoup
with open('test.html', 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'html.parser')
print(soup.prettify())
The output is as follows:
<html> <head> <title> This is a test. </title> </head> <body> <h1> Test </h1> <p> This is a test. </p> <p> This is the second test. </p> </body> </html>
Then, we can read the content of the HTML file using the variable soup. Furthermore, we can find the child elements of soup. As an instance:
print(soup.html.head.title)
The output is as follows:
<title>This is a test.</title>
We can easily get it that soup.html.head.title is the title element of the HTML file. However, the title element is the only element and it can be easily obtained. Simiar method cannot be used to get a certain element which has multiple same tags.
In this situation, we can use the find_all() method to get all the elements with the same tag. For example:
print(soup.find_all('p'))
The output is as follows:
[<p>This is a test.</p>, <p>This is the second test.</p>]
For content of the element, we can use the text attribute. For example:
print(soup.title.text)
The output is as follows:
This is a test.
Remote html file / web page
Now let's try to get the content of a remote html file. Before using BeautifulSoup to parse the DOM tree, we need to get the content of the remote html file. We can use the requests package to get the content of the remote html file. For example: