Web Scraping with Playwright

JavaScript

The internet has a ton of valuable data, but it’s not always easy to grab in a usable format. We can use web scraping to pull data from websites that don’t make it easy to download. This gives us a way to analyze our competitors, monitor prices, track trends, and more.

In the previous article, we learned how to do Web Scraping using the Cheerio library. However, it has some limitations because it relies on the entire HTML content being available when we fetch the page. However, many applications are interactive and depend on JavaScript to display the data on the screen. To scrape them, we need to implement a different approach.

Playwright is a popular tool commonly used for End-to-end tests. However, its use is not restricted to testing.

If you want to know more about testing with Playwright, check out JavaScript testing #17. Introduction to End-to-End testing with Playwright

Playwright allows us to scrape the data from websites by opening them in an actual web browser. Thanks to this, we are not limited to scraping static content.

Running Playwright for Web Scraping

First, let’s install and set up Playwright.

Let’s use Playwright to scrape a list of videos on YouTube for a given search query. The first step is to launch a browser and create a new, blank page.

scrape.ts

Above, we avoid opening the page in the headless mode for testing purposes. This way we can see what the browser is doing.

Interacting with the page

Now, we can interact with the page and navigate to YouTube.

openYoutube.ts

scrape.ts

The first thing we notice is that YouTube opens a popup asking us about the cookies.

To close it, we can use the Locators API built into Playwright, which allows us to find elements on the page. To understand what element we’re looking for, we can navigate to the Developer Tools in the browser that Playwright opens. When we do that, we notice that the reject button looks like this:

The above is a simplified representation of the button without the included attributes.

We can use the above knowledge to find the necessary button and close it.

openYoutube.ts

Similarly, we can use the Locators API to find videos using a given query.

searchForVideos.ts

scrape.ts

Scraping the data

To get the data of each video, we need to notice that each one is represented with an anchor.

When we perform an action on a single element, like clicking it, Playwright automatically waits for that element to appear and become actionable. However, when we want to work on multiple elements, Playwright doesn’t automatically wait for any of them to appear. Because of that, we should explicitly wait for the elements to appear.

getVideosData.ts

Now, to get the data about each video, we need to use the method. It allows us to run a piece of code in the browser instead of Node.js to access the attributes of each element.

getVideosData.ts

The last step is to create an array that contains the titles and URLs of each video.

getVideosData.ts

scrape.ts

Navigating to multiple pages

We can use Playwright to open multiple different pages concurrently. Let’s use that to get the details of each video. First, let’s modify our function to handle various pages on YouTube.

openYoutube.ts

So far, we have the information about each video in an array. It contains the title and URL of each video.

scrape.ts

Let’s create a function that navigates to each video to get more detailed information. What’s crucial is that we can do that in parallel by combining and the function.

getDetailedVideosData.ts

We open a new page for each video and use the Locators API to gather additional data.

getDetailedVideosData.ts

We must also remember to close the browser when it is not needed anymore.

scrap.ts

Summary

In this article, we used the Playwright tool to scrape the data from YouTube. While this might be more resource-heavy than using libraries such as Cheerio, it has a lot of benefits. By operating within a real browser environment, we’re not limited to scraping static websites and can handle dynamic, JavaScript-driven content. Overall, using Playwright is a solid approach for scraping both static and dynamic websites, allowing us to analyze the content of various applications.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments