Beginner’s guide to Web Scraping with Cheerio

JavaScript

The internet is filled with vast amounts of valuable data. However, much of the information isn’t available in a structured format we can download and use. To handle that, we can use web scraping. Web scraping is a process of extracting data from websites that don’t provide it in a structured format out of the box, for example, through a REST API. Typically, it involves fetching a web page and analyzing its content programmatically.

Such data can be used for monitoring prices, analyzing competitors, tracking trends, and more. While this information is often publicly available on the internet, scraping it may still violate the Terms of Service of some websites. Therefore, you should proceed with caution.

Scraping using Cheerio

One approach to scraping we can use is to simply fetch the raw HTML content of a website and parse it using the Cheerio library. The Cheerio library has an intuitive set of functions that mimic jQuery.

Fetching and parsing the page

First, we need to fetch the contents of a web page and parse it with Cheerio. To test it, let’s use the JavaScript page on Wikipedia.

fetchWebsiteContent.ts

To use the above function, we must provide the URL of the page we want to fetch.

index.ts

Reading the contents of the page

When we look at the JavaScript page on Wikipedia, we can see that it lists various information about JavaScript.

To scrape it, we must look closer at the website’s HTML. First, let’s focus on the name of the language.

Looks like we need to find an element with the .

scrapeProgrammingLanguageInformation.ts

When using Cheerio, we typically use the variable name to mimic the jQuery API.

Above, we look through the page and find the element with the class. Then, we retrieve its text content.

More advanced queries

Now, let’s find the designer of the JavaScript language. When we take a look at the HTML from Wikipedia, we can see that it is stored in a table row.

To find out who designed the language, we can:

  1. Find the anchor element looking for the content.
  2. Look for its closest ancestor.
  3. Finds it’s closest sibling.
  4. Extract the content from .

When doing that, we need to watch out for the hard space – included in the title.

scrapeProgrammingLanguageInformation.ts

We need to do similar operations to extract the date of the first appearance and the current stable release.

scrapeProgrammingLanguageInformation.ts

Now, we can use our function to fetch the data about a particular programming language.

This can work for other programming languages on Wikipedia.

However, the crucial thing is that our code needs to be tailored to handle the specific HTML. For example, if Wikipedia changes the class name to something else, our code won’t work anymore.

Unfortunately, Wikipedia is not necessarily consistent. For example, the page for Bash does not have the “Designed by” section. Instead, it has the “Original author” and “Developer” sections.

Fetching multiple pages

We can use Cheerio to traverse multiple pages by following anchor – – elements. For example, let’s fetch the information about all curly bracket languages.

To do that, we first need to fetch the page that contains the URLs of the pages we want to scrape.

fetchMultipleProgrammingLanguagesData.ts

Now, we need to create an array of URLs where each one points to a page about a particular programming language.

fetchMultipleProgrammingLanguagesData.ts

Now, we should fetch the information about every programming language. To do that in parallel, we need to use the function.

fetchMultipleProgrammingLanguagesData.ts

With this approach, we can scrape multiple pages at once.

It’s crucial to keep in mind that not every Wikipedia page follows the same HTML structure. For example, the first item on the list doesn’t have the data we’re looking for, and we’re not handling it gracefully.

Summary

In this article, we’ve explored using the Cheerio library for web scraping. To do that, we first fetched the HTML contents of the page and then parsed them using Cheerio.

Cheerio is a great tool for web scraping in many cases, but it does have some limitations. The key drawback is that it relies on the full HTML content being available when we fetch the page. However, a lot of websites are interactive and depend on JavaScript to display the data on the screen. However, many websites use JavaScript to load and display data dynamically. Since Cheerio doesn’t execute JavaScript, it can’t scrape content that is rendered on the client side. If we want to do that, we should use tools such as Playwright.

However, Cheerio is a great choice for web scraping if we aim for simplicity and don’t need to deal with dynamic content.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments