If you’re like me, you like to explore how to automate some of your workflows, and also save a few bucks in the process. In this post I will walk through downloading financial statements using the R programming language. Given the news of a potential WPX merger, I suppose I will focus on the opposite party, Devon Energy.
Side note, I am using the free version of RStudio Cloud for the first time. While it is size limited on the free tier, you don’t have to install anything on your machine and it works for this workflow.
Financial Statement Sources
For anyone who has worked in corporate valuation, or for you investors that actually like to look at financials instead of following animal spirits, a wealth of information is buried within corporate financial statements. The problem? These documents are large, the data/tables are all over the place, and there isn’t really an easy way to parse through a TON of different filings and look at how things trend over time.
So how do analysts typically work through this? Their companies have licenses to things like EIKON, Factset, Bloomberg Terminal, etc. A single user license can cost between 10–20k/year or more, so for us peions that don’t have corporate backstopping, other methods need to be found. Personally, I like to spend somewhere around $0/year, so let’s work through an alternative approach.
We will focus on something knows as eXtensible Business Reporting Language, or XBRL.
XBRL is an attempt to make financial statements digital and more standardized. While a worthwhile goal, anyone that has worked through it has realized it is anything but standardized. Regardless, most, if not all at this point, companies submit their filings in an XBRL format with the SEC. Buried deep within is even more data than is actually surfaced in the financial statements.
There is a standard taxonomy as well to define how to link all of this various data together, and it is about as convoluted as you can imagine.
All over the place. Basically, each box is a separate data frame in R, and the arrows/underlined rows tell us the common elements between them.
I will condense it down as best I can and let you know what is important:
- Role: Typically defines the various tables within the financial statements. If you select type of ‘Statement’ that will give you the standard Balance Sheet, Income Statement, and Cash Flow Statements. If you like to dig deeper, you can also select ‘Disclosure’ type, which will have non-standardized tables, but can include further disclosures of debt, etc.
- Presentation: This will basically give you the structure of the table you select. Key things I typically use are the roleId (which links to the Role table), preferredLabel (not shown above), and then fromElementId and toElementId. Think of fromElementId as a parent-node (like Current Assets), and then toElementId as the children (like Cash and Cash Equivalents).
- Calculation: Similar to Presentation table, but we can use it to determine which of the elements are going to be ‘Summation’ items, or basically derived categories that we are used to seeing. Think like Cash Flow From Operations, Current Assets, Total Equity, etc.
- Fact: This contains the raw numeric data. The columns I use are elementId, contextId, unitId, fact, and decimals. You’ll notice there is no start or end date in this file, so these numbers are just floating in space. The link to our Presentation table is from toElementId to elementId.
- Context: This is where all of our date data can be found. The link between this table and Fact is contextId. The dimension and value columns are also very important. Basically, if you want to replicate what’s in the actual statement, you want all Dimensions with a null value (don’t ask me why). If you want sub-components of the presented data, you want everything else. There is a bit of some finesse here too, as these subcomponents will have the same elementId as their parent value if joined to the Fact table. Which is all well and fine, but when we join the Label table to this, all the subcomponents will have the same name. You can do some text plugging and figure out what these labels are sometimes, but it’s not very consistent. My preference for the subcategories is just to concatenate the elementId to the values columns for a label, as it is often pretty self-explanatory.
- Label: This tells us what the standard label is for each elementId. You will join this to the Fact table, then to the Presentation table, and then you can filter so that preferredLabel = labelRole.
- Unit: Standard(ish) units for each elementId. Joins to Fact table by unitId.
Seems like a lot, and it is. But I will work through an example to show it in action.
GETTING THE RAW DATA
As always, someone has done most of the work already as far as packages go. Of course, the quirk is that some of these packages for XBRL aren’t being actively updated so we need to do a little bit of additional work.
Here is what we will need for this workflow:
- dplyr – Data maniuplation in the tidyverse
- tidyr – Pivot-type operations in the tidyverse
- lubridate – Working with data data in the tidyverse
- stringr – String operation sin the tidyverse
- devtools – The current build of the edgarWebR package does not work with the current version of R, so we need to use the devtools package to install it straight from github.
- edgarWebR – edgarWebR provides an interface to access the SEC’s EDGAR system for company financial filings.
- XBRL – Allows us to download XBRL files from the SEC and puts it into a list of data frames.
- scales – This is installed with one of the other installs, and this will just be used to convert numbers to a currency value.
If you have none of this installed, the below works in the most recent version of R on Windows and Linux.
install.packages('dplyr') install.packages('tidyr') install.packages('lubridate') install.packages('stringr') install.packages('devtools') install.packages('XBRL') devtools::install_github("mwaldstein/edgarWebR") library(dplyr) library(tidyr) library(lubridate) library(stringr) library(edgarWebR) library(XBRL) library(scales)
Locate the Files
Before we can download them, we actually have to locate the files we want, which is where edgarWebR comes in. Since we are focused on Devon, we can use the following to find the last several years/quarters of 10-Q/K’s (though you can also find various other filings too).
comp.ticker <- 'DVN' FilingsonEdgar <- edgarWebR::company_filings(x = comp.ticker, type = "10-") #Look at first six rows of data head(FilingsonEdgar)
Figure 1: First 6 Rows – Devon Quarterly/Annual Filings from EdgarWebR
We are just going to look at the most recent quarter (though you can go back as far as you want), so we’ll need the href column for the first row. This basically takes us to the landing page of that filing, and the file we want is on that page. To find that XML file, we just need to find the links on the page and select the proper one (took me awhile but these selections work for all 10-Q/K’s).
DocumentsonEdgar <- edgarWebR::filing_documents(x = FilingsonEdgar$href) link <- DocumentsonEdgar[DocumentsonEdgar == 'XML'| DocumentsonEdgar == 'EX-101.INS', 4]
To download the files, it used to be as simple as:
xbrl.vars <- XBRL::xbrlDoAll(link, verbose=TRUE)
What will happen is a new folder within your directory will be created called xbrl.Cache. This will store several of our files needed to interpret the data. However, the last couple of years something went wrong with downloading some of the xsd files we need. So we will see an error like this:
Downloading to cache dir...trying URL 'https://www.sec.gov/Archives/edgar/data/1090012/000156459020036568/https://xbrl.sec.gov/country/2020/country-2020-01-31.xsd' Error in fileFromCache(file) : Error in download.file(file, cached.file, quiet = !verbose) : cannot open URL 'https://www.sec.gov/Archives/edgar/data/1090012/000156459020036568/https://xbrl.sec.gov/country/2020/country-2020-01-31.xsd' In addition: Warning message: In download.file(file, cached.file, quiet = !verbose) : cannot open URL 'https://www.sec.gov/Archives/edgar/data/1090012/000156459020036568/https://xbrl.sec.gov/country/2020/country-2020-01-31.xsd': HTTP status was '404 Not Found'
Basically, there is something wrong with the source code. So, annoyingly, we need to go find this:
Paste that link into your browser, right-click within the text and then save it directly into your xbrl.Cache folder. After a few companies/years, you’ll eventually have all the ones you need and won’t have to do it again, at least until they change it up again. One other note, make sure to save it without the .xml extension or it won’t work (ie the end of the file should say .xsd).
xbrl.vars <- XBRL::xbrlDoAll(link, verbose=TRUE)
For this workflow it will happen 7 total times (annoyingly). To save some time in this workflow, the files needed are:
Once you run it the final time, you will have a list of dataframes called xbrl.vars.
Interpreting the Data
Find the Table
Let’s inspect the Roles table first. For this first attempt, let’s try and replicate the Income Statement; the Balance Sheet is actually the easiest, but where’s the fun in that?
I’m going to filter by ‘Statement’ type and remove roleId with ‘Parenthetical’ in the name, as I’ve found those are useless.
roles <- xbrl.vars$role %>% filter(type == 'Statement') %>% filter(!grepl('Parenthetical', roleId))
Figure 2: Devon Financial Statements in XBRL
We are going to select the first row, as that is the Income Statement, or in this example the Statement of Comprehensive Earnings.
Build our Table
This is a big piece of code, so I will explain it below.
#Roles Table roles1 <- roles[1,] sums <- xbrl.vars$calculation %>% filter(grepl('summ', arcrole)) %>% filter(roleId %in% roles1$roleId) sums <- sums %>% filter(toElementId %in% sums$fromElementId) #Presentation Table pres1 <- xbrl.vars$presentation %>% filter(roleId %in% roles1$roleId) %>% mutate(order1 = seq(1, n(), 1)) %>% mutate(elementId = toElementId) %>% left_join(xbrl.vars$fact) %>% left_join(xbrl.vars$context %>% filter(is.na(dimension1))) %>% filter(!is.na(endDate)) %>% left_join(xbrl.vars$label) %>% filter(labelRole == preferredLabel) %>% arrange(order, order1) %>% subset(select = -c(arcrole)) %>% left_join(sums %>% select( toElementId, arcrole)) %>% select(roleId,arcrole, labelString, startDate, endDate,fromElementId, elementId, unitId, fact, decimals, order, order1) %>% distinct() %>% group_by(roleId, arcrole, labelString,fromElementId, elementId, startDate, endDate, order, order1) %>% filter(as.numeric(decimals) == max(as.numeric(decimals))) %>% ungroup() %>% subset(select = -c(decimals)) %>% filter(!duplicated(paste0(startDate, endDate, fact, elementId))) %>% mutate(date1 = paste0(startDate,' to ', endDate)) %>% subset(select = -c(startDate, endDate)) %>% mutate(fact = scales::dollar(as.numeric(fact)/1000)) %>% spread(date1, fact) %>% arrange(order, order1) %>% select(labelString, arcrole, contains(' to ')) %>% mutate(labelString = replace(labelString, !is.na(arcrole), paste0('<b>', labelString[!is.na(arcrole)], '</b>'))) %>% subset(select = -c(arcrole)) names(pres1) <- 'In Thousands'
What did I do here? A lot. Let’s go through it.
- Select the table I want into a new dataframe called roles1
- The sums table finds the rows in the column that are summations, like Total Revenue or Net Income.
- Starting with my Presentation table, as it holds my order and the id columns I need, and filtering to the Income Statement.
- Create a sequence of numbers called order1 to hold my order in place. There already is an order column, but it is not always a reliable value for ordering.
- Create a new column called elementId which is equal to toElementId that lets me join my fact, context, and label tables to my frame.
- I join my Fact table and removing any of my fact lines, which are my values, that show NA.
- I join my Context table to my fact table, which automatically searches for common columns (contextId). As I mentioned, we want to also filter out anything that actually has any value in our dimension1 field. In another iteration of this post, I will work through those that do.
- Removing empty endDate rows.
- I also join my Role and Label tables. I filter so that preferredLabel == labelRole.
- Arrange data and then join the sums table I created earlier.
- The remainder is sorting and selecting the variables I want. However, one thing you need to do is group by your remaining variables outside of decimals and select the highest one. What you’ll find is sometimes you’ll get the same variable twice but rounded to two different decimal places. What this is doing is selecting the one that is closest to actual and removing the duplicate. There is a standard order by table, so I’m using that as well. I also do some text manipulation to bold summation rows and to combine start and end dates.
So now the table basically exists, but we need to put it into presentation form. I didn’t mention this package earlier, but DT allows me to do some good customizable tables.
DT::datatable(pres1, rownames=FALSE, escape = F,extensions = c('Scroller'), options = list(dom='Bfrtip', paging = FALSE, scrollX=TRUE, scrollY=TRUE, info = FALSE, ordering = FALSE, searching = FALSE), caption = htmltools::tags$caption( style = 'caption-side: bottom; text-align: center;', 'Table: ', htmltools::em('DVN Q2 2020 Income Statement')), class = 'cell-border stripe')
And here it is:
Figure 3: Devon Q2 2020 Income Statement from XBRL
And there we have it. You can basically do this same workflow for any one of the tables within the Roles table. When you get sufficiently advanced, you can just loop through each company/year and build out entire historicals, though as I mentioned it can get complicated trying to match labels across different periods.
With time I will expand into other R libraries more dedicated to financial analysis, like tidyquant and quantmod, but also move further into the XBRL to show what else can be done with it. In the end, you have all of the raw financial data for each period sitting within the xbrl.vars vaiable.
To follow along as we dig deeper into tips and tricks with R, follow my blog.