Introduction to web scraping and examples in Python with BeautifulSoup4 & Selenium

Published in

CodeX

8 min readMar 19, 2022

Since the appearance of the Internet, the scope of data expanded over time, and its amount grew exponentially. It’s because unlike in the past, anyone can upload any type of data to the Internet. One of the benefits of this is that anyone can download any data, and one way of doing it is web scraping.

What is web scraping?

As the name suggests, it is a data collection method that automatically reads off and saves information from a web page. You could have heard some people use web crawling interchangeably with web scraping. They work similarly, but they are used with different intentions.

You generally use web scraping when you know which data you want to extract from a web page in advance. In contrast, web crawling downloads the entire web page and archives it. So you use web crawling when you want the whole web page for later use. Anyhow, how is web scraping possible? This question leads to the structure of a web page.

How a web page is formed

Since this isn’t the main topic of the post, I will only provide a brief explanation. So if you know the basics of it, you can skip to the next block.

Everything that you see on a web page is written in HTML, CSS, and JavaScript code. Among those, HTML constructs the content of a web page, CSS designs the content, and JavaScript applies interactions to a web page.

Web page with only html / Web page with html & css

Since most people’s main focus is the content of a web page, web scraping generally cares only about the HTML code of a web page.

What consists of HTML?

The main component of HTML code is the tag, and there are many types of them. Most of them are consist of an opening and closing part that each looks like <tag_name> and </tag_name>. Each tag has a unique purpose, and all combined define a web page’s content. It isn’t necessary to know all of their purposes, but knowing a few common ones will be helpful.

body tag: parent tag of all contents shown on a web page
div tag: sets a division in a web page
p tag: contains a paragraph of text
a tag: includes a link to other websites
li tag: generally used when there is a list of things

To see the list of tags, visit Way2Tutorial.

What’s inside the tags

We can divide that into two parts. First, things that are inside <> are called attributes, and they generally define the characteristics of a tag. Class, id, href fall into this category. To explain these briefly, the class applies the same style consistently. So each class name should be unique, but the same class can be applied to multiple elements. Unlike the class, id is a unique identifier of an element in a web page. Thus, only one unique id should exist on a web page. But most websites are lenient on this. And href contains a link to another web page. If you want to know more about attributes, visit w3schools.

Second, there are things between <> and </>, which is the actual content a tag is holding. It can be either text or other tags so that they become nested.

How does web scraping work

As I mentioned above, web scraping is reading off information from a website. So the basic process is

1. Find the base URL
2. Open a web page
3. Read its HTML file
4. Locate elements
5. Extract them
6. Save data

The details of this process differ depending on what and where you are trying to extract. In this post, I will explain two methods with examples: using BeautifulSoup4 and Selenium.

Using BeautifulSoup4 in python

Let’s say you want to extract articles from the National Institute of Health website. You first need to find where the articles are located and get the URL of that page.

Once you’ve identified the base URL, you use the requests package in python to get a web page. However, you can’t use the returned result yet because its content is in byte format, which is extremely hard to extract information from.

Thus, you should parse the content with BeautifulSoup() from the bs4 package. The benefit of doing this is that it allows you to locate elements with tag, class, or id. You can also move up and down between nested tags to reach your target tags.

Using the requests and bs4 libraries for retrieving web page content

So before you start extracting data, you need to look into the source code of a web page and figure out what’s unique about the tag that can be defined with tag, class, or id. Do they have a unique class name? Are they children of a certain tag? Are they inside a unique tag? Using the “right-click → Inspect” from a web page will help you to find answers to such questions.

When you find a unique way to extract tags, you now use find() or find_all() with the unique way as its parameter. In my case, using the tag and class names was sufficient.

Lastly, you save your extracted data in a format of your choice, which can be CSV, JSON, SQL, etc.

What if BeautifulSoup4 is not enough?

Depending on the situation, you might need to interact with a web page such as scrolling or clicking a button to grab the information you want. Or sometimes, BeautifulSoup4 simply does not work because a web page needs to run its JavaScript code to display its HTML code. In these cases, instead of using the request and bs4 packages, you need to actually open a web page and interact with it. And selenium could be a good choice for this.

An example with WashingtonPost web page that shows BeautifulSoup does not work

Using selenium in python

Because selenium opens a web page, it works a little bit differently from BeautifulSoup4. You need to set up a webdriver first. And since you can’t download selenium in a cloud environment, you have to use the following setting in Google Colab.

The webdriver option to use selenium in Google Colab

With the ‘- -headless’ option, you run your code without visually showing a web page, and you need this because Google Colab doesn’t support GUI. The next option you should add is the ‘- -no-sandbox’, which means the webdriver removes the security layer that prevents each tab from affecting other tabs. Lastly, the ‘- -disable-dev-shm-usage’ option lets your code write shared memory files into /tmp instead of /dev/shm, which is too small that could make Chrome crash.

After setting the driver, you use the get() on the driver element to open a web page of your choice. Unlike the request with BeautifulSoup4, you have an actual web page, so you don’t need to parse it and can locate elements right away. Similarly to bs4, you can use tag, class, or id names with find_element() or find_elements(). If these are not sufficient, you can also use xpath which is an XML path expression for finding any element on a web page.

Opening the web page and extracting tags

As mentioned previously, you can also interact with a web page. In my case, I also want to grab information that is shown after I click the “Load more” button on the WashingtonPost page.

All I need to do is locate the button and send the Enter-key value so that my code clicks instead of me. Then, because it takes time to process the JavaScript code associated with the load button, I should wait until the page loads. And there are two ways to do that.

First, there is an implicit wait which waits for a specified time before it continues or crashes. One of the downsides is that if a web page loads quickly, the remaining seconds would be a waste of time. So the second method, called an external wait, comes in. This method waits until an expected condition is met or a specified time. So it stops waiting as soon as the expected condition is met or waits until the specified time and throws an error. And the expected condition can be until a specific element is present, clickable, located, etc. More conditions can be found here.

Using click and explicit wait with selenium

Lastly, when you are done web scraping, you save the data you extracted like before.

Advantages and disadvantages of web scraping

With web scraping, you can collect almost any information from almost any website. On top of that, because it is an automated process, you can save a lot of time and money. However, you should use it with precautions.

Because it heavily relies on a structure of a web page and the structure can change anytime, your code can break anytime. Thus, you need to monitor the process regularly to make sure your code works. Also, you are using other websites’ resources to collect data. If you blindly send lots of requests, you can overload their websites, give a false insight to them, or your activity can look suspicious which ultimately leads to banning your IP. So it’s best to use an API if a website provides one. But when it does not, try to use web scraping with a limited request rate, proxy, and a mindset that respects others’ works.

Reference

[1] Perez, Martin. “Web Scraping vs Web Crawling: What’s the Difference?: ParseHub.” ParseHub Blog, ParseHub Blog, 24 Mar. 2021, https://www.parsehub.com/blog/web-scraping-vs-web-crawling/.

[2] Phoenix, James. “What Are the Advantages and Disadvantages of Web Scraping Data?” Just Understanding Data, 18 Feb. 2021, https://understandingdata.com/the-advantages-disadvantages-of-web-scraping-data/.

[3] Rungta, Krishna. “Implicit, Explicit and Fluent Wait in Selenium WebDriver.” Guru99, 12 Mar. 2022, https://www.guru99.com/implicit-explicit-waits-selenium.html.

[4] Rungta, Krishna. “Xpath in Selenium: How to Find & Write Text, Contains, or, And.” Guru99, 12 Mar. 2022, https://www.guru99.com/xpath-selenium.html.

[5] Way2Tutorial. All HTML Tags List, https://way2tutorial.com/html/tag/index.php.