For my last post, I used a python script to scrape the data from a website. I used Python..just because I am used to do webscraping in Python. But I heard R also got better at scraping, so I rewrote my script in R.
The package rvest is the equivalent of BeautifulSoup in python. It is available since 2014 and created by Hadley Wickham. Underneath it uses the packages ‘httr’ and ‘xml2’ to easily download and manipulate html content.
You can use rvest in the following way:
[code language=”r” wraplines=”true” collapse=”false”]
# install and load package
install.packages(“rvest”)
library(rvest)
url <- “http://live.ultimate.dk/desktop/front/?eventid=2021049&language=nl”
data <- read_html(url)
resultsTable <- data %>% html_nodes(“table.leaderboard_table_results”)
rows <- resultsTable %>% html_nodes(“tr”)
for(i in 1:length(rows)){
tds <- rows[i] %>% html_nodes(“td”)
print(tds[4] %>% html_text)
print(tds[10] %>% html_text)
}
[/code]
A couple of things are good to know:
- get the website content with read_html(<URL>): this will return an xml document
- select content from certain nodes with html_nodes(<element>.<classname>)
- get attribute content from a node with html_attr(<name>)
- the pipe operator “%>%” can be used to chain operations. Use it, it’s very convenient.
When you are used to BeautifulSoup, it is easy to learn rvest, because it has a similair syntax.
You can find the R and python scripts that I wrote for webscraping below. I am wondering what language you prefer for webscraping? Please let me know in a comment below.
3 thoughts on “Web scraping: R vs python”
Nice. Let me just add a ruby example (my favourite 😉 ), using the nokogiri library:
require 'nokogiri'
require 'open-uri'
url = "http://live.ultimate.dk/desktop/front/?eventid=2021049&language=nl"
data = Nokogiri::HTML(open(url))
data.css("table.leaderboard_table_results tr").each do |row|
tds = row.css("td")
p tds[4].text
p tds[10].text
end
nice Maarten! I should learn Ruby:-)
Scraping the web with Python is much easier.
Thanks for the tips.
Comments are closed.