This post will show you how to get data from a webpage (also known as web scraping) with R and the rvest package. This analysis was performed to complement the data obtained from the spotify API. Since I could not obtain all the data I was interested in from the latter API, I decided to web scrape the bandcamp site.
The first thing is to load the necessary packages: rvest for the webscraping and purrr for making loops.
Then, indicate the website of interest. In this case, bandcamp.
Here is the dataframe I am going to use to consult the data from bandcamp.
The next step is creating a function that will be used to make the search in bandcamp. Some additional tweaks had to be made so that the function worked. Most of this part was defined by playing with the bandcamp’s search bar and annotating how the url of the search was processed. Then, you need to inspect the web page to see the names of the sections you are interested in extracting. Finally, I just made some data wragnling to clean the data and export it more homogeneously.
The main functions for webscraping with rvest are read_html, html_node and html_text. The first one enables reading the html code of the indicated url. The second one enables extracting one node or section of this web page and finally, the third one converts the extracted object into text.
Use map to apply the functino to each artist. Use possibly as a TryCatch; thus, if no genre was found for certain artist it will return the message “Error in file”.
Then add the genres as a new column to the previous dataframe and add a counter (llist) that indicates how many genres were associated with each artist.
Then eliminate entries without a genre, and unnest the genres list. Obtain the final data frame.
A snapshot of the result:
Example of the data obtained after web scraping the bandcamp site.