Impairments in Shale – Scraping SEC filings with R

As a continuation of my post on how to get Financial Statements in R, I though I would delve into some of the other things you can do with the raw data beyond just trying to replicate a financial statement. In this insight we will pull down high-level impairment data to track the trend over time.


In the last insight, I dug deep into into just how to use the various packages in R to grab the raw financial data from the SEC. I will not rehash that here. But not to fret, I will dig a little deeper in how to do it for multiple companies over differing time periods.


There are quite a few packages I need, which I have used extensively in other insights. I will list them here, but feel free to search around if you want to know more about them.

  • dplyr
  • lubridate
  • purrr
  • rvest
  • httr
  • stringr
  • glue
  • edgarWebR
  • XBRL
  • data.table
  • highcharter

I am also going to create a custom function that measures the time between two dates. I do this to measure duration time between the initial and final reported dates in SEC filings, as you may be aware is reported in various combos of 3,6,9, and 12 months. Which stinks to be honest.

elapsed_months <- function(end_date, start_date) {
  ed <- as.POSIXlt(end_date)
  sd <- as.POSIXlt(start_date)
  12 * (ed$year - sd$year) + (ed$mon - sd$mon)


First thing we want to do is define our company list. I will use the typical shale-focused operators, up to and including ConocoPhillips. The large integrated’s will be ignored in this go-round. However, I also want to include several bankrupt and acquired operators (like Chesapeake), to try and more accurately gauge the impact.

These operators include Chesapeake, Encana (to grab previous Ovintiv data), Energen, Quicksilver, Extraction, EP Energy, RSP Permian, Exco, and Anadarko. Probably forgetting some but I’m lazily going off memory here.

opList1 <- data.frame(operators = c('APA', 'APC',  'AR',  'CDEV', 'CHAP', 'CHK', 'CLR',
             'CNX', 'COG', 'COP', 'CPE',  'CRK', 'CXO',  'DVN', 'EP', 'EOG',
             'EQT', 'EXCO',  'FANG', 'GPOR', 'GDP',  'HES', 'HPR', 'KWK',  'LPI', 'MGY', 'MR',
             'MRO', 'MTDR', 'MUR', 'NBL',  'OAS', 'OXY', 'ECA', 'OVV', 'PDCE', 'PE', 'PVAC', 'PXD', 'QEP', 
             'RRC', 'RSP', 'SBOW', 'SM', 'SWN',  'WLL', 'WPX', 'XEC', 'XOG'))

Of course, the problem is that the non-current operators don’t get pulled in with the typical R packages we discussed last time. So I will have to fake it a little bit. To do so, I need to go to the SEC website and search for these operators and find their CIK number.

The Central Index Key (CIK) is used on the SEC’s computer systems to identify corporations and individual people who have filed disclosure with the SEC. To find a CIK for a company, fund, or individual type in as much of the company name as you know.,company%20name%20as%20you%20know.

So what I will do is perform a similar search to what the edgarWebR package does. The main thing we use for that package is the reference to the SEC page that holds all of the data for each report, so we will use glue to find that website.

To replicate what I’m doing, just go to the SEC link, search for a company like CHK, and then find all reports that contain 10-. Copy that url. For example, that url I just found was:

You will notice the CIK= line within the search, so that is what we will replicate. I’ve gone ahead and found the CIK values for them.

opList1$cik <- NA
opList1$cik[opList1$operators == 'KWK'] <-  '0001060990'
opList1$cik[opList1$operators == 'CHK'] <-  '0000895126'
opList1$cik[opList1$operators == 'XOG'] <-  '0001655020'
opList1$cik[opList1$operators == 'ECA'] <-  '0001157806'
opList1$cik[opList1$operators == 'EP'] <-  '0001584952'
opList1$cik[opList1$operators == 'EGN'] <-  '0000277595'
opList1$cik[opList1$operators == 'RSP'] <-  '0001588216'
opList1$cik[opList1$operators == 'EXCO'] <-  '0000316300'
opList1$cik[opList1$operators == 'APC'] <-  '0000773910'

#Use glue to find target website for each CIK value
opList1$url1 <- glue::glue(';CIK={opList1$cik}&amp;type=10-&amp;dateb=&amp;owner=exclude&amp;count=40&amp;search_text=')

#Zero out urls for current companies as we can grab the filings using edgarWebR package
opList1$url1[$cik)] <- NA

Now I have the list of operators, and a url to use in the event that edgarWebR will not pull the data for me.

Time to loop.


As I did in the last insight, I will use xbrlDoAll from the XBRL package to grab the raw data. But this time around, I will loop through all of the operators, pull data for annual filings and quarterly in 2020, and then search for any elementId that contains the character string Impair. I will also ensure that it comes from the actual statements as opposed to any of the side tables from the report.

The preferred method of looping in R is to use the apply family of functions, or in our case, lapply.

Fair Warning: Without using parallel computing, the following code may take an hour or so to run. If I wasn’t searching for a specific item in the XBRL, I would probably just download all of the raw data with each loop just to keep myself from running it over and over, though memory size can be an issue.

Please work through the previous tutorial or this will not work as expected.

econSummary <- lapply(split(opList1, opList1[,'operators']), function (tickers) tryCatch({
  comp.ticker <- tickers$operators
    chkURL <- tickers$url1
    pg <- (read_html(chkURL) %>% html_nodes('table') %>% .[3] %>% html_table())[[1]]
    links <- read_html(chkURL)
    links <- data.frame(links = (html_attr(html_nodes(links, "a"), "href")), ticker = comp.ticker) %>%
      filter(grepl('Archives', links))
    pg$href <- links$links
    pg$href <- paste0('', pg$href)
    pg$`Filing Date` <- as.Date(pg$`Filing Date`)
    names(pg) <- c('type', 'format', 'description', 'filing_date', 'file_number', 'href')
    FilingsonEdgar <- pg
  } else {
  FilingsonEdgar <- edgarWebR::company_filings(x = comp.ticker, type = "10-")

  FilingsonEdgar <- FilingsonEdgar  %>%
    filter(type == '10-K'|year(filing_date)==2020) %>%
    filter(!grepl('A', type))
  data1 <- lapply(split(FilingsonEdgar, FilingsonEdgar[,'href']), function (well) tryCatch({
    DocumentsonEdgar <-  edgarWebR::filing_documents(x = 
    link <- DocumentsonEdgar[DocumentsonEdgar[5] == 'XML'|
                               DocumentsonEdgar[5] == 'EX-101.INS', 4]
    xbrl.vars <- XBRL::xbrlDoAll(link, verbose=TRUE)
    roles1 <- xbrl.vars$role %>% filter(grepl('Statement', type)) %>%
      filter(!grepl('arenth', roleId)) %>% subset(select = -c(order)) %>%
      left_join(xbrl.vars$presentation, by = 'roleId')
    impairments <- xbrl.vars$fact %>%
      filter(grepl('Impair', elementId)) %>%
      filter(elementId %in% roles1$toElementId) %>%
      left_join(xbrl.vars$context) %>%
      filter(! %>%
      left_join(xbrl.vars$label) %>%
      left_join(roles1 %>% select(elementId = toElementId, preferredLabel)) %>%
      filter(labelRole == preferredLabel) %>%
      filter( %>%
      filter(!duplicated(paste0(fact, startDate, endDate))) %>%
      group_by(startDate, endDate, fact) %>% filter(as.numeric(decimals) == max(as.numeric(decimals))) %>%
      ungroup() %>%
      mutate(company = comp.ticker) %>%
      select(company,elementId,label = labelString, startDate, endDate, fact) %>%
      mutate(filingDate = well$filing_date) 
  error = function(e) {

  df1 <- data.table::rbindlist(data1) 
  df1 <- df1 %>% distinct()
  df1 <- df1 %>% group_by(startDate, endDate) %>% filter(filingDate == min(filingDate)) %>% ungroup()
  df1$months <- elapsed_months(df1$endDate, df1$startDate)+1
  df1 <- df1 %>% arrange(endDate, months)
  df1$yr <- year(df1$endDate)
  df1 <- df1 %>% group_by(yr) %>% filter(months == max(months)) %>% ungroup()
  df1 <- df1 %>% select(company, elementId, yr, fact)

error = function(e) {

Looks complicated? Yeah, it is a little bit, but here is what I’m doing:

  1. Use lapply to create a loop for each stock ticker. I also use tryCatch to define what happens when an error occurs; in this case it just records a NULL value and goes to the next ticker. Before I figured out how to do this I would get super frustrated because the loops would break each time an error occurred.
  2. Create an if-else statement that uses edgarWebR to find all the 10-Q/K page links if it’s an existing company, or uses a custom html scraper to do so if it’s a past company.
  3. Filter to either a 10-K or 10-Q if the filing year was 2020.
  4. Loop again through each companies filing list.
  5. For each loop, I will initially find any of the tables that are the base Financial Statements. Called it roles1.
  6. Search for Impair within each elementId from the facts table.
  7. Filter so that this matches equivalent toElementId‘s from the roles1 table.
  8. Join context table, which contains our date data.
  9. Remove any value where there is no start date, as I’m looking for duration-type values as opposed to point-in-time like the balance sheet.
  10. Remove duplicates and attach labels as per the last insight we did.
  11. End the inner loop, use data.table to combine all the data, and then filter so that each instance of a period/elementId is the originally reported data. Because I’m including acquired companies I felt like this was a better representation. If you’re doing only current companies, probably worth it to use the last reported date instead as it’s more of a pro-forma view.
  12. Use the maximum reported monthly period for each year (usually 12, but will use the maximum period for the current year too, so in our case through Q2 2020).

Final Cleanup

I am going to be a bit lazy here. You could get a bit more advanced and look through each specific item and determine which values you want to use. My assumption is that for each company/year, the largest value is the one that encompasses the sum total of all impairments. You will notice that a lot of the data is actually the sub-components, and then there is one line that represents the sum.

df <- data.table::rbindlist(econSummary) %>% mutate(fact = as.numeric(fact))

#Rename ECA to OVV
df$company[df$company == 'ECA'] <- 'OVV'

#Find largest value by year/company/elementId
df <- df %>% group_by(company, yr, elementId) %>% filter(fact == max(fact)) %>%
  ungroup() %>% distinct()

#Filter to 2010+
df <- df %>% filter(yr >= 2010)

#Find largest elementId for each company/year
df <- df %>% group_by(company, yr) %>% filter(fact == max(fact)) %>% ungroup() %>%
  filter(!duplicated(paste0(company, yr, fact)))

This is our dataset.


Now we have a dataset of impairments by year for each operator. Let’s plot it up with highcharter.

Annual Total

How does it look on just a summary basis?

yrly <- df %>% group_by(yr) %>% summarise(impairments = sum(fact)) %>% ungroup()

highchart() %>%
  hc_add_series(yrly, type = 'column',
                hcaes(x = yr, y = as.integer(impairments/1000000000)), 
                name = 'Total Impairments', showInLegend = F) %>% 
    pointFormat = str_c(
      "<b>Total Impairments: </b>US${point.y} billion"
    useHTML = TRUE
  ) %>%
  hc_yAxis(min = 0,gridLineColor = 'transparent',
           title = list(text = 'Total Impairments<br>US$ Billions', 
                        style = list(fontFamily = 'Arial', color = 'black', fontSize = '18px')),
           labels = list( style = list(fontFamily = 'Arial', color = 'black', fontSize = '14px'))) %>%
  hc_xAxis(labels = list( style = list(fontFamily = 'Arial', color = 'black', fontSize = '14px'))) %>%
  hc_title(text = 'US Shale Total Impairments - 2010+', align = 'left', 
                     style = list(fontFamily = 'Arial', color = cols[12])) %>%
  hc_subtitle(text = 'Source: SEC', align = 'left', 
           style = list(fontFamily = 'Arial')) 

Figure 1: Annual Impairments in the Shale space, $US Billions

Ouch. 2015 was a stinker, especially if you lived through it, but halfway through 2020 and we are well on our way to eclipsing that. Deloitte mentioned something along the lines of $300 billion, and with the way the way the SEC defines the price deck used for asset value, then who knows, maybe so.


Let’s look at the guys who have written off the most over time. I’m also going to get a bit fancy on the highcharts as well.

Seriously, highcharts is awesome, and there are so many things you can do with it.

company <- df %>% group_by(company) %>% summarise(total = sum(fact)) %>% ungroup() %>% arrange(desc(total))
company$text <- paste0(company$company, ' - US$', round(company$total/1000000000,1), ' Billion')

donutdata2 <- df %>% 
  select(company, yr, fact) %>% 
  nest(-company) %>% 
    data = map(data, mutate_mapping, hcaes(x = yr, y = round(fact/1000000000,1)), drop = TRUE),
    data = map(data, list_parse)
  ) %>%
  rename(ttdata = data) %>% 
  left_join(company) %>%

  hcaes(name = company, y = round(total/1000000000, 1)),
  innerSize = 500) %>% 
  hc_title(text = 'US Specific Company Impairments - 2010+', align = 'left', 
           style = list(fontFamily = 'Arial', color = cols[12])) %>%
  hc_subtitle(text = 'Source: SEC', align = 'left', 
              style = list(fontFamily = 'Arial')) %>%
    useHTML = TRUE,
    headerFormat = "<center><b>Total Impariments: </b><br>US${point.y} billion</center>",
    pointFormatter = tooltip_chart(
      accesor = "ttdata",
      hc_opts = list(
        chart = list(type = "column"),
        credits = list(enabled = FALSE)
      height = 300,
      width = 350
    positioner = JS(
      "function () {
        /* one of the most important parts! */
        xp =  this.chart.chartWidth/2 - this.label.width/2
        yp =  this.chart.chartHeight/2 - this.label.height/2
        return { x: xp, y: yp };
    shadow = FALSE,
    borderWidth = 0,
    backgroundColor = "transparent",
    hideDelay = 1000

Figure 2: Company Specific Impairments in the Shale space, $US Billions

Slightly surprised to see DVN that high ($30+ billion), but given the heavy investments in places like the Barnett and the decline in gas prices, may make some sense. It’s not like the Felix acquistion in Scoop/Stack worked out that well. The combined amount from Anadarko (APC) and OXY is throught the roof. Imagine if I had included CRC!


Thanks for reading. This is not necessarily representative of true change in asset value as we are using high-level methods, but is defiitely indicative of the trend in value. Definitely interested to see the quantum of impairments at Year-End.

Our goal is to show you how to be much better at integrating code into your workflow, and this is an example of some of the things you can do with the raw SEC data. I will likely dig into unsecured debt in the next one.

Please follow our other posts to learn more, or follow our blog.

1 thought on “Impairments in Shale – Scraping SEC filings with R”

  1. Pingback: Kodak Insider Transactions - Scraping SEC data with R - Shale Insights

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: