Scraping Comp Tables – R Coding Series

Last year I shared an insight on shale executive compensation in the US in 2019. In this tutorial, I’ll give you a bit of a peek under the hood on how to do this, from finding the data and locating/scraping the tables you want in particular. As more operators release their annual proxies, I will revisit that insight again later this month.

So let’s get to learnin’.


Where to Find the Data

Public operators typically file form DEF 14A with the SEC, which is the annual proxy statement. Buried within this is a summary of historical compensation for management, board makeup, and bonus payout hurdles; the latter being one I will likely address at a later date. The key takeaway for us is that it is usually an HTML table, which we can scrape easily….. as long as we can find it.

Of course, folks like Chesapeake and ExxonMobil like to use images instead of tables, so to get those would require more advanced PDF-scrapers which I won’t go into in this one.

For our purposes, we like to do things with code to simplify this search. First off, let’s fire up R and load the libraries, and then we want to locate the CIK identifier, which is how the SEC identifies these companies on the website. We will start the analysis off with ConocoPhillips.

library(rvest) #Web Scraping from Tidyverse
library(finreportr) #General SEC Details
library(glue) #String manipulation
library(dplyr) #Tidy data cleaning
library(tidyr) #Tidy data in R
library(janitor) #Cleaning up my table
library(stringr) #More string manipulation

comp.ticker <- 'COP'
CIK <- (CompanyInfo(comp.ticker))$CIK

For ConocoPhillips, that is CIK of 0001163165. You will notice that I used the finreportr package for this, which generally works ok. However, oftentimes I want to go back and look at acquired or now defunct companies, and this package only uses active ones. In that scenario, you will need an alternative method of locating the CIK.

Next, we will need to search the SEC website for all of the 14-A filings; but let’s not go to our browser. We can easily do that search using R instead.

url1 <- glue::glue('https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={CIK}&type=DEF+14A&dateb=&owner=exclude&count=40&search_text=')

#Locate the links to all the landing pages for annual DEF 14As for COP
pg <- (read_html(url1) %>% html_nodes('table') %>% .[3] %>% html_table())[[1]]

#Get the actual url for the links
pg$href <- paste0('https://www.sec.gov',(data.frame(links = (html_attr(html_nodes(read_html(url1), "a"), "href")), ticker = comp.ticker) %>%
                            filter(grepl('Archives', links)))$links)

#Get the year of the disclosure (which is 1 less than the filed date)
pg$YEAR <- as.numeric(substr(pg$`Filing Date`, 1, 4))-1

#Select the link and YEAR
pg <- pg %>% select(href, YEAR)

Go through each row of the data frame, click on the link, and then locate the link to each individual document
pg <- pg %>% rowwise() %>% 
                         mutate(href = (read_html(href) %>% html_nodes('table') %>% .[1] %>%
                                            html_nodes('a') %>% html_attr('href'))[1]) %>%
                         ungroup() %>%
                         mutate(href = paste0('https://www.sec.gov', href)) %>%
                         mutate(ticker = comp.ticker)

We use glue to easily insert our predefined CIK into as string within the {}. While it appears ridiculous to do this for a single url, when you start packaging this all together and running loops/etc., it comes in quite handy.

Then we search the initial page for the url to the landing page of each year’s DEF 14A filing, and then we search again for link to the actual html document.

Now we have a list of all of the proxy filings, and the links to them.


Locate the Table

The next step is to actually find the table we want. If you actually search for how many individual tables there are, you will notice A LOT of tables to parse through and extract data from. But let’s be lazy (smart).

After inspecting these visually many, many times, I can tell you that 95% of the time, the table we want contains the words PRINCIPAL and SALARY. So let’s test it out on our first table in the list.

#Select 2020 Def 14-A Filing for ConocoPhillips
x <- pg[1,]

#How many tables are there in the entire form?
length(read_html(x$href) %>% html_nodes('table'))

I see 641 individual tables, which is quite hefty. In case the syntax isn’t clear, rvest allows us to load the webpage (read_html), and then search for all nodes within the webpage that contain a table. Instead of analyzing every one, let’s just find the table that contains our target words.

which((data.frame(txt1 = toupper(read_html(x$href[1]) %>% html_nodes('table') %>% html_text())) %>%
  mutate(txtFind = if_else(grepl('PRINCIPAL', txt1) & grepl('SALARY', txt1), 1, 0)))$txtFind == 1)[[1]]

which allows us to search for the table containing our words. You’ll also notice that I used toupper to capitalize all of the words, as it makes it simpler than just searching for various combinations of upper and lower case.

Now we know which table we want, I will pull it into a data frame and then use the janitor package to remove empty columns (which is super common in html tables).

x1 <-  (read_html(x$href) %>% html_nodes('table'))[[which((data.frame(txt1 = toupper(read_html(x$href[1]) %>% html_nodes('table') %>% html_text())) %>%
  mutate(txtFind = if_else(grepl('PRINCIPAL', txt1) & grepl('SALARY', txt1), 1, 0)))$txtFind == 1)[[1]]]] %>% html_table(header = F, fill = T) %>% 
  janitor::remove_empty(which = 'cols') 

html_table puts it into dataframe form; and to save yourself some headaches use the fill = T options. I do not want to fill in the header in this scenario, as you will often notice that companies will have multiple rows with header information.


Data Cleanup

Now for the step that always takes the longest in data science/exploration. Cleaning up the data/table. HTML tables generally are not designed to make it easy to plug into something for analysis. I have spent countless hours trying to optimize the process myself, but it still takes patience and a lot of editing.

First, I will identify the header row and then combine multiple rows/clean up the table header info. If you don’t do this step, a lot of the tidying up of data does not work as expected.

#Search for which row has either SALARY or POSITION within it
item1 <- which(apply(x1, 1, function(r) any(grepl('SALARY', toupper(r))|grepl('POSITION', toupper(r)))))

#Choose the last row where this occurs
item1 <- item1[length(item1)]
 
#Concatenate all rows through the last row just calculated by column and rename the columns
names(x1) <- str_squish((data.frame(t(x1[1:item1,]))%>% unite(id, everything(), sep = ' '))$id)

#Rename empty columns
 names(x1)[names(x1) == ''] <- 'test'

#Add extra text to make each column name unique
 names(x1) <- make.unique(names(x1))
 
 #Remove all the header rows
 x1 <- x1[(item1+1):nrow(x1),]

As it stands now, the table is a bit of a mess.

ConocoPhillips 2020 Executive Compensation

There are several repeat columns, and frankly just empty ones. So first, let’s remove empty strings, as well as various characters we do not want (like unicode or $/, signs). Could probably remove punctuation, but that includes the risk that someone may have reported compensation down to the penny.

x1 <- x1 %>%
   mutate(across(everything(), str_squish)) %>%
   mutate(across(everything(), function(x) gsub('\u200b', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0097', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0093', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0094', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0092', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0086', '',x, fixed = T))) %>%
  mutate(across(everything(), function(x) gsub('$', '',x, fixed = T))) %>%
  mutate(across(everything(), function(x) gsub(',', '',x, fixed = T))) %>%
  mutate(across(everything(), function(x) gsub('–', '0',x, fixed = T))) %>%
   mutate(across(everything(), function(x) str_squish(gsub("\\s*\\([^\\)]+\\)","",as.character(x))))) %>%
  mutate(across(everything(), function(x) replace(x, x == '', NA))) %>%
    janitor::remove_empty(which = 'cols')

I am going to rename the first two columns to be Name and Year. 9 times out of 10 this works, though there are a few companies that don’t have tables line up this way. These have to be done differently but I’ll save that for myself.

The rest of the code will be doing work to extract the name and position of each company. This is a little bit hit and miss as it is usually the case that they report the title below the name, but sometimes it is next to it. Regardless, it works well for this scenario.

I also gather all of the other columns into Item and Value columns (outside of Name, Position, and Year), and then identify which was the filing year from our initial data pull.

x1 <- x1 %>%
    rename(Name = 1, Year = 2) %>%
    mutate(ID = if_else(as.numeric(Year) >= lag(as.numeric(Year)), 1, 0)) %>%
    mutate(ID = replace(ID, is.na(ID), 1)) %>% mutate(ID = cumsum(ID)) %>%
    group_by(ID) %>% mutate(name = Name[1]) %>% ungroup() %>% 
    mutate(position1 = word(name, 2, 3, sep = fixed('\n'))) %>%
    mutate(name = word(name, 1, sep = fixed("\n"))) %>%
    mutate(position1 = replace(position1, is.na(position1) & !name == Name & !is.na(Name), Name[is.na(position1) & !name == Name & !is.na(Name)])) %>%
    mutate(position1 = replace(position1, is.na(position1), '')) %>%
    group_by(ID) %>% mutate(position1 = paste(position1, collapse = ' ')) %>% ungroup() %>%
    mutate(position1 = str_squish(position1), name = str_squish(name)) %>% subset(select = -c(Name, ID)) %>%
    gather(item, value, -c(name, position1, Year)) %>% mutate(value = as.numeric(value)) %>%
    mutate(item = gsub('\n', ' ', item), item = str_squish(gsub("\\s*\\([^\\)]+\\)","",as.character(item)))) %>%
    mutate(item = toupper(item), name = toupper(name), position1 = toupper(position1)) %>%
    rename(Name = name, Item = item, Value = value, Position = position1) %>%
    mutate(ticker = x$ticker, filingYr = x$YEAR) %>%
    mutate(Item = word(Item, 1, sep = fixed("."))) %>% filter(!is.na(Value))

If you want to dig deeper, I would work through each line separately so you notice what happens at every step. Regardless, now we have a tidy table that will allow us to do further analysis.


Loop

The next thing we want to do, is turn the above part into a function, and then loop through our links and combine them all into one table.

check1 <- function(x) tryCatch({
 x1 <-  (read_html(x$href) %>% html_nodes('table'))[[which((data.frame(txt1 = toupper(read_html(x$href[1]) %>% html_nodes('table') %>% html_text())) %>%
  mutate(txtFind = if_else(grepl('PRINCIPAL', txt1) & grepl('SALARY', txt1), 1, 0)))$txtFind == 1)[[1]]]] %>% html_table(header = F, fill = T) %>% 
  janitor::remove_empty(which = 'cols') 
 

 
 item1 <- which(apply(x1, 1, function(r) any(grepl('SALARY', toupper(r))|grepl('POSITION', toupper(r)))))
 item1 <- item1[length(item1)]
 
 names(x1) <- str_squish((data.frame(t(x1[1:item1,]))%>% unite(id, everything(), sep = ' '))$id)

 names(x1)[names(x1) == ''] <- 'test'
 names(x1) <- make.unique(names(x1))
 
 
 x1 <- x1[(item1+1):nrow(x1),]
 
 x1 <- x1 %>%
   mutate(across(everything(), str_squish)) %>%
   mutate(across(everything(), function(x) gsub('\u200b', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0097', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0093', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0094', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0092', '',x, fixed = T))) %>%
   mutate(across(everything(), function(x) gsub('\u0086', '',x, fixed = T))) %>%
   
  mutate(across(everything(), function(x) gsub('$', '',x, fixed = T))) %>%
  mutate(across(everything(), function(x) gsub(',', '',x, fixed = T))) %>%
  mutate(across(everything(), function(x) gsub('–', '0',x, fixed = T))) %>%
   mutate(across(everything(), function(x) str_squish(gsub("\\s*\\([^\\)]+\\)","",as.character(x))))) %>%
   
  mutate(across(everything(), function(x) replace(x, x == '', NA))) %>%
   
    janitor::remove_empty(which = 'cols') %>%
  rename(Name = 1, Year = 2) %>%
  mutate(ID = if_else(as.numeric(Year) >= lag(as.numeric(Year)), 1, 0)) %>%
  mutate(ID = replace(ID, is.na(ID), 1)) %>% mutate(ID = cumsum(ID)) %>%
  group_by(ID) %>% mutate(name = Name[1]) %>% ungroup() %>% 
  mutate(position1 = word(name, 2, 3, sep = fixed('\n'))) %>%
  mutate(name = word(name, 1, sep = fixed("\n"))) %>%
  mutate(position1 = replace(position1, is.na(position1) & !name == Name & !is.na(Name), Name[is.na(position1) & !name == Name & !is.na(Name)])) %>%
  mutate(position1 = replace(position1, is.na(position1), '')) %>%
  group_by(ID) %>% mutate(position1 = paste(position1, collapse = ' ')) %>% ungroup() %>%
  mutate(position1 = str_squish(position1), name = str_squish(name)) %>% subset(select = -c(Name, ID)) %>%
  gather(item, value, -c(name, position1, Year)) %>% mutate(value = as.numeric(value)) %>%
  mutate(item = gsub('\n', ' ', item), item = str_squish(gsub("\\s*\\([^\\)]+\\)","",as.character(item)))) %>%
  mutate(item = toupper(item), name = toupper(name), position1 = toupper(position1)) %>%
   rename(Name = name, Item = item, Value = value, Position = position1) %>%
    mutate(ticker = x$ticker, filingYr = x$YEAR) %>%
   mutate(Item = word(Item, 1, sep = fixed("."))) %>% filter(!is.na(Value))
 
 return(x1)
},
error = function(e) {
  e
  NULL
})

I create a function, and wrap it in a tryCatch so that it will not stop when an error occurs, but will instead just move on to the next link.

Now, I will just run the function across all of the links and populate a single table.

cop <- dplyr::bind_rows(lapply(split(pg, pg['href']), check1))

We now have a dataset to do more advanced analysis.


Example

While I will not show the code here, I use the highcharts library to visualize some of the things you can do with the data.

First is our big ballers.

Figure 1: Top Ten Total Compensation – US Shale

And next is the trend in total compensation over time. Does appear that it is broadly in line with their share prices, which is not too surprising given how much of compensation is share-based.

Figure 2: US Independent Compensation over Time – US Shale


Finale

And that is it. I hope you enjoyed this tutorial and find it useful. Stay tuned for the next insight on annual salary when the bulk of the filings are done.

Also Give Us A Follow!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: