Automate Your Download of Baker Hughes Rig Count Data (in R)

As we turn to expanding the coding series into things like web scraping, I thought it’d be a good opportunity to show how to download Baker Hughes Weekly Rig Count Data automatically in R and to generate useful information from it.

Follow my blog


This one is going to be a little bit more intensive. For some reason, Baker Hughes makes the rig count file .xlsb format, which does not work well with downloads. One of the packages I show (RODBC) is working under a Windows install, and I’m fairly certain it won’t work on the other systems so you would likely need to find another way.

Before the packages, make sure your Windows database driver for Access is up to date. I went with the x64 as I’m working off 64-bit version (and you likely are as well).


  1. rvest – Web Scraping Package from tidyverse
  2. httr – Package for working with html data
  3. downloader – Helper for downloading files
  4. RODBC – Working with databases in R
  5. dplyr – Data manipulation in R from tidyverse
  6. lubridate – Working with date data from the tidyverse
  7. tidyr – Make data “tidy” in R, part of the tidyverse
  8. glue – String manipulation in the tidyverse

Download Data

Downloading the data isn’t too difficult, it’s pulling it into an R dataframe that is more complex. Luckily the packages I installed make it easier. With RODBC we can connect to the .xlsb file, though it does take a couple of minutes to bring into R as it is a pretty big file.

First, I locate the file I want. This file extension will likely change continually, so that is why I need to work through this step to identify the excel file I need.

Find the file

#Load Packages

#Baker Hughes Site
url1 <- ''

#Read the html page info into R and search for links
df1 <- read_html(url1)
df1 <- df1 %>% html_nodes('a') %>% html_attrs()
dfy <- data.frame()

#Run a quick loop to add the link and name of the file for each row
i <- 1
while(i <= length(df1)){
  dfx <- data.frame(href = df1[[i]][1],
                    title = df1[[i]][3])
  dfy <- rbind(dfy, dfx)
  i <- i+1

#Filter to the Pivot file; which is the one we are after
dfy <- dfy %>% filter(grepl('Pivot', title))


Now I can download the data to my computer and then pull it into an R dataframe. I will also delete the file after.

#Download file to computer
downloader::download(dfy$href, destfile = dfy$title, mode = 'wb')

#Make sure you install driver for Microsoft Access

#Find our directory
sysLoc <- getwd()

#Get the file name
wb <- glue::glue("{sysLoc}/{dfy$title}") 

#Use RODBC to connect and fetch the data from the Master Data tab
con2 <- odbcConnectExcel2007(wb)
data <- sqlFetch(con2, "Master Data")
#Close my connection

#Delete file

Now I have a dataframe called data that I can work with.


It’s always worth it to investigate the data if you get a chance. You can use some combo of:

  1. str(data) – Column data types
  2. summarise(data) – Numerical information for each column (average, distributions, etc)
  3. head(data) – First few rows of the table


First, let’s just look at how total rig count has moved by country over time.

tst1 <- data %>% filter(Location == 'Land')%>%
 group_by(Country, PublishDate) %>% 
 summarise(count = n()) %>% ungroup() %>%
 mutate(PublishDate = as.Date(PublishDate))

Figure 1: Baker Hughes Rig Count by Country

A little bit insane that the rig count is roughly 10-20% of what it was in the early 2010’s.

Rig Type

I will also look to see how rig types have moved over time.

tst1 <- data %>% filter(Location == 'Land')%>%
 group_by(Trajectory, PublishDate) %>% 
 summarise(count = n()) %>% ungroup() %>%
 mutate(PublishDate = as.Date(PublishDate))

Figure 2: Baker Hughes Rig Count by Well Type

The count hasn’t moved much in recent months as we’ve settled at a price below where most operators can generate sufficient corporate-level returns. You’d probably need a sustained move into the $50‘s before that starts occurring.

Basin Level

I will also look at the basin level.

tst1 <- data %>% filter(Location == 'Land')%>%
  group_by(Basin, PublishDate) %>% 
  summarise(count = n()) %>% ungroup() %>%
  mutate(PublishDate = as.Date(PublishDate))

Figure 3: Baker Hughes Rig Count by Basin

Interesting, but I’m going to look at Year-Over-Year change instead, and this time only focus on the horizontals in the US.

This code is a bit more complicated, but basically I am finding the last week reported in 2020, grabbing only 2019 and 2020 data from this same week, and then comparing the values between 2019 and 2020 by Basin.

tst1 <- data %>% filter(Location == 'Land') %>% 
  mutate(Date = as.Date(PublishDate)) %>% #Change date format
  filter(year(Date) == 2020) %>% select(Week) %>% #Filter to year 2020 and then select Week Column
  filter(Week == max(Week)) %>% distinct() %>% #Get maximum week in 2020
  left_join(data %>% filter(Location == 'Land')) %>% #Rejoin our data file
  mutate(Date = as.Date(PublishDate)) %>% #Change date format
  filter(year(Date)==2019|year(Date)==2020) %>% #Get Last two years
  filter(Trajectory == 'Horizontal') %>% #Only horizontals
  filter(Country == 'UNITED STATES') %>% #US Baby
  group_by(Year = year(Date), Basin) %>% #Group by the Year and Basin
  summarise(count = n()) %>% ungroup() %>% #Get total counts of rigs
  spread(Year, count) %>% arrange(desc(`2019`)) %>% #Pivot the table wider to a 2019 and 2020 column
  mutate(`2020` = replace(`2020`,`2020`), 0), 
         `2019` = replace(`2019`,`2019`), 0)) %>% #Remove NAs
  mutate(Change = `2020` - `2019`) #Find difference

Figure 4: YOY Horizontal Rig Count Change by Basin

I like this view better as it really highlights just how each basin’s rig count has changed over the last year.


That’s it for this one. Once you are much better at coding and manipulating data, you can automatically download this data and build various scripts to work in a quicker/more dynamic fashion.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: