PPP Energy Loans – Scraping HTML Tables in R

Inspired by a tweet from @WAR527, I thought I would build a simple web scraper to pull a table of energy PPP loans into R and generate an insight off of it.


Scraping the Data

First thing we do is look at the website and see the structure. Given that it is static HTML this should be fairly simple. There is also only one table, which makes our job even easier. Still, there are a few packages we need.


Packages

  • rvest – Helps scrape information from webpages
  • httr – Wrapper for curl package, largely to work with modern day API calls
  • dplyr – Data manipulation in R
  • zoo – Working with indexed data in R
  • highcharter – Awesome plotting library

Download Data

Scraping the site is actually quite easy. What we do is read the website structure, find the table we want (the first and only one), and then extract that table into a dataframe.

library(rvest)
library(httr)
library(dplyr)
library(zoo)
library(highcharter)

url1 <- 'https://pestakeholder.org/ppp-loans-flow-to-energy-companies-backed-by-deep-pocketed-private-equity/'
pg <- (read_html(url1) %>% html_nodes('table'))[[1]] %>% html_table(header=T)

And that’s it, we have our data in a dataframe (or basically a table like you would see in excel).


Data Investigation/Cleanup

Before I do any analysis, I need to look at the table data and see what form it is in, or if there are any missing values.

head(pg)

The data analyst in me notices a couple of things that could make things difficult. The appearance of blank spaces in the Loan Size column, and the multiple investors in the Active Investors column. I’ll deal with each separately


Loan Size

First off, I am going to plug those blanks in the Loan Size column, and then I will just take the midpoint of each one as my estimate of total loan amount. The na.locf function in zoo essentially accomplishes what a fill-down will in Excel, and will plug the Loan Size column with the last amount that occurred.

pg$`PPP Loan   Size`[pg$`PPP Loan   Size` == ''] <- NA
pg$`PPP Loan   Size` <- na.locf(pg$`PPP Loan   Size`)

pg$loan[pg$`PPP Loan   Size` == '$5-10 million'] <- 7500000
pg$loan[pg$`PPP Loan   Size` == '$2-5 million'] <- 3500000
pg$loan[pg$`PPP Loan   Size` == '$1-2 million'] <- 1500000
pg$loan[pg$`PPP Loan   Size` == '$350,000-1 million'] <- 675000

Active Investors

I’m also going to extract the individual operators, which are delimited by a comma. I will then make a gross assumption that they are all equally invested in each company (likely incorrect but probably close enough). You will also see the rep function called, which just means repeat the item a specific number of times. At the end, the dataframe will be reconstructed so that each company/investor combination is a row.

pg$`Active   Investors` <- gsub("\\s+", " ", pg$`Active   Investors`)

#Create a list for each row split by a comma
s <- strsplit(pg$`Active   Investors`, split = ",")

#Find total items in each list and put that count into each row of original dataframe
pg$investors <- sapply(s, length)
pg$investors[pg$investors == 0] <- 1

#Split loan amount between investors
pg$loan <- pg$loan/pg$investors

#rename columns to make it easier to work with
names(pg) <- c('loanSize', 'company', 'location', 'activeInvestors', 'loan', 'investors')

#Recreate the dataframe to have one row per investor
pg <- data.frame(company = rep(pg$company, sapply(s, length)), 
           location = rep(pg$location, sapply(s, length)),
           loan = rep(pg$loan, sapply(s, length)), investor = unlist(s))

#Remove trailing and beginning white spaces
pg$investor <- trimws(pg$investor, which = 'both')

Now are data is at a point to do some quick insights.


Graphs

I’ll now do a look at which areas were impacted the most, and then further dig into each investor.


Geography

I’ll save the charting code, as it’s quite extensive, but will show the process to look at each individual location or state.

locationList <- pg %>% group_by(location) %>% 
    summarise(total = sum(loan)) %>% ungroup() %>%
    arrange(desc(total))

#Get last two characters of each location (which is the state)
locationList$state <- substr(locationList$location, nchar(locationList$location)-1, 
                             nchar(locationList$location))

#Look at total PPE loans by state
state1 <- locationList %>% group_by(state) %>% 
    summarise(total = sum(total)) %>% ungroup() %>%
    arrange(desc(total)) 

Figure 1: PPE Loans by Location, $US Millions

Ouch, Texas is hit pretty hard with around $100 million total. Though you can click on each bar to see the pie chart, I’ll at least show Texas.

Figure 2: PPE Loans by City in Texas, $US Millions

Houston and the surrounding areas represent roughly 45%.


By Investor

Which PE backers were most exposed to PPE-taking companies? Similarly, we can just group by investor and sum it up.

investorList <- pg %>% group_by(investor) %>% summarise(total = sum(loan)) %>% ungroup() %>%
  arrange(desc(total))

I’ll just look at the top 15.

Figure 3: PPE Loans by Investor, $US Millions

Turnbridge and EIG lead the way, with some other well known investors up there as well. Were these guys opportunistic in instructing these companies to take down the loans or is it more of a reflection of asset quality? Not for me to opine on. Would be interesting to see what share of AUM these companies represent in the portfolio.

Summary


That’s it for this one. Hopefully you learned something. If you want to follow along as we do more coding, check out our coding series.

And give us a follow while you’re at it.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: