The internet is filled with vast amounts of valuable data. However, much of the information isn’t available in a structured format we can download and use. To handle that, we can use web scraping. Web scraping is a process of extracting data from websites that don’t provide it in a structured format out of the box, for example, through a REST API. Typically, it involves fetching a web page and analyzing its content programmatically.
Such data can be used for monitoring prices, analyzing competitors, tracking trends, and more. While this information is often publicly available on the internet, scraping it may still violate the Terms of Service of some websites. Therefore, you should proceed with caution.
Scraping using Cheerio
One approach to scraping we can use is to simply fetch the raw HTML content of a website and parse it using the Cheerio library. The Cheerio library has an intuitive set of functions that mimic jQuery.
Fetching and parsing the page
First, we need to fetch the contents of a web page and parse it with Cheerio. To test it, let’s use the JavaScript page on Wikipedia.
However, the crucial thing is that our code needs to be tailored to handle the specific HTML. For example, if Wikipedia changes the
infobox-title class name to something else, our code won’t work anymore.
Unfortunately, Wikipedia is not necessarily consistent. For example, the page for Bash does not have the “Designed by” section. Instead, it has the “Original author” and “Developer” sections.
Fetching multiple pages
We can use Cheerio to traverse multiple pages by following anchor –
<a> – elements. For example, let’s fetch the information about all curly bracket languages.
To do that, we first need to fetch the page that contains the URLs of the pages we want to scrape.
firstAppearance:'September 30, 2021; 3 years ago (2021-09-30)',
stableRelease:'5.0.0.71\n / September 29, 2024; 4 months ago (2024-09-29)'
},
{
title:'B',
designedBy:'Ken Thompson',
firstAppearance:'1969; 56 years ago (1969)[1]',
stableRelease:''
},
...
]
It’s crucial to keep in mind that not every Wikipedia page follows the same HTML structure. For example, the first item on the list doesn’t have the data we’re looking for, and we’re not handling it gracefully.
Summary
In this article, we’ve explored using the Cheerio library for web scraping. To do that, we first fetched the HTML contents of the page and then parsed them using Cheerio.
Cheerio is a great tool for web scraping in many cases, but it does have some limitations. The key drawback is that it relies on the full HTML content being available when we fetch the page. However, a lot of websites are interactive and depend on JavaScript to display the data on the screen. Since Cheerio doesn’t execute JavaScript, it can’t scrape content that is rendered on the client side. If we want to do that, we should use tools such as Playwright.
Cheerio is still a great choice for web scraping if we aim for simplicity and don’t need to deal with dynamic content.