Inspired by a tweet from @WAR527, I thought I would build a simple web scraper to pull a table of energy PPP loans into R and generate an insight off of it.
Scraping the Data
First thing we do is look at the website and see the structure. Given that it is static HTML this should be fairly simple. There is also only one table, which makes our job even easier. Still, there are a few packages we need.
- rvest – Helps scrape information from webpages
- httr – Wrapper for curl package, largely to work with modern day API calls
- dplyr – Data manipulation in R
- zoo – Working with indexed data in R
- highcharter – Awesome plotting library
Scraping the site is actually quite easy. What we do is read the website structure, find the table we want (the first and only one), and then extract that table into a dataframe.
library(rvest) library(httr) library(dplyr) library(zoo) library(highcharter) url1 <- 'https://pestakeholder.org/ppp-loans-flow-to-energy-companies-backed-by-deep-pocketed-private-equity/' pg <- (read_html(url1) %>% html_nodes('table'))[] %>% html_table(header=T)
And that’s it, we have our data in a dataframe (or basically a table like you would see in excel).
Before I do any analysis, I need to look at the table data and see what form it is in, or if there are any missing values.
The data analyst in me notices a couple of things that could make things difficult. The appearance of blank spaces in the Loan Size column, and the multiple investors in the Active Investors column. I’ll deal with each separately
First off, I am going to plug those blanks in the Loan Size column, and then I will just take the midpoint of each one as my estimate of total loan amount. The na.locf function in zoo essentially accomplishes what a fill-down will in Excel, and will plug the Loan Size column with the last amount that occurred.
pg$`PPP Loan Size`[pg$`PPP Loan Size` == ''] <- NA pg$`PPP Loan Size` <- na.locf(pg$`PPP Loan Size`) pg$loan[pg$`PPP Loan Size` == '$5-10 million'] <- 7500000 pg$loan[pg$`PPP Loan Size` == '$2-5 million'] <- 3500000 pg$loan[pg$`PPP Loan Size` == '$1-2 million'] <- 1500000 pg$loan[pg$`PPP Loan Size` == '$350,000-1 million'] <- 675000
I’m also going to extract the individual operators, which are delimited by a comma. I will then make a gross assumption that they are all equally invested in each company (likely incorrect but probably close enough). You will also see the rep function called, which just means repeat the item a specific number of times. At the end, the dataframe will be reconstructed so that each company/investor combination is a row.
pg$`Active Investors` <- gsub("\\s+", " ", pg$`Active Investors`) #Create a list for each row split by a comma s <- strsplit(pg$`Active Investors`, split = ",") #Find total items in each list and put that count into each row of original dataframe pg$investors <- sapply(s, length) pg$investors[pg$investors == 0] <- 1 #Split loan amount between investors pg$loan <- pg$loan/pg$investors #rename columns to make it easier to work with names(pg) <- c('loanSize', 'company', 'location', 'activeInvestors', 'loan', 'investors') #Recreate the dataframe to have one row per investor pg <- data.frame(company = rep(pg$company, sapply(s, length)), location = rep(pg$location, sapply(s, length)), loan = rep(pg$loan, sapply(s, length)), investor = unlist(s)) #Remove trailing and beginning white spaces pg$investor <- trimws(pg$investor, which = 'both')
Now are data is at a point to do some quick insights.
I’ll now do a look at which areas were impacted the most, and then further dig into each investor.
I’ll save the charting code, as it’s quite extensive, but will show the process to look at each individual location or state.
locationList <- pg %>% group_by(location) %>% summarise(total = sum(loan)) %>% ungroup() %>% arrange(desc(total)) #Get last two characters of each location (which is the state) locationList$state <- substr(locationList$location, nchar(locationList$location)-1, nchar(locationList$location)) #Look at total PPE loans by state state1 <- locationList %>% group_by(state) %>% summarise(total = sum(total)) %>% ungroup() %>% arrange(desc(total))
Figure 1: PPE Loans by Location, $US Millions
Ouch, Texas is hit pretty hard with around $100 million total. Though you can click on each bar to see the pie chart, I’ll at least show Texas.
Figure 2: PPE Loans by City in Texas, $US Millions
Houston and the surrounding areas represent roughly 45%.
Which PE backers were most exposed to PPE-taking companies? Similarly, we can just group by investor and sum it up.
investorList <- pg %>% group_by(investor) %>% summarise(total = sum(loan)) %>% ungroup() %>% arrange(desc(total))
I’ll just look at the top 15.
Figure 3: PPE Loans by Investor, $US Millions
Turnbridge and EIG lead the way, with some other well known investors up there as well. Were these guys opportunistic in instructing these companies to take down the loans or is it more of a reflection of asset quality? Not for me to opine on. Would be interesting to see what share of AUM these companies represent in the portfolio.
That’s it for this one. Hopefully you learned something. If you want to follow along as we do more coding, check out our coding series.
As annual proxy filings start making their way into the stratosphere, I thought it would be a good time to run through how to scrape them in R.
This one comes at us from an anonymous member of #EFT, who I will just call Mr EFT. Mineral Adventure Oklahoma is a weird place (my family is actually from there so can confirm). A lot of really crazy backwoods characters. Also a lot of fractured mineral interests floating around, so tracking them down can …
Inspired by the @StreetBomber story on the Chuck Yates Needs A Job podcast, I thought it may be fun to start a thread on everyone’s fun workplace stories. So I’ll give it a go first. To keep it going, just reach out to me on Twitter and we can discuss, or send me the story …
And give us a follow while you’re at it.