How to Interpret Frac Focus Data in the Bakken

If I can’t find completion data easily, the best spot to grab it is usually Frac Focus, or better known as “The national hydraulic fracturing chemical disclosure registry”. It has only been around since 2011, so it will not have ALL of the data, but it will be enough for our purposes.

In this post I will show how to gather data and interpret the data using R.

If you don’t care about coding, there are still some cool graphs and maps if you scroll down.

Follow My Blog

Data Gathering

First, there are a few packages we are going to use.

  1. dplyr (and plyr but don’t load that package)
  2. tidyr
  3. lubridate
  4. readr
  5. downloader
  6. highcharter – This requires a license to display (which I have).
  7. leaflet
  8. sf


Next we will download the data and unzip it. I also want to locate the csv files (as of now there are 20+) that contain the basic chemical data.

options(stringsAsFactors = FALSE)
options(scipen = 999)

#Download the Data
    dest='', mode='wb')

#Unzip data
#remove zip file

#Locate CSV Files
csv_files <- list.files(path=getwd(), pattern = '*.csv')
csv_files <- csv_files[grepl('FracFocusR', csv_files)]

The first two lines tell us that text needs to be read in as a character type instead of factor, which will really bog down our machine. The second moves away from scientific notation, which gets annoying because it reads in API number as a numeric point. I usually convert it to character afterwards.

Load Data to R

Now that I know where the files are, I can use plyr to download the data into a dataframe. This process might take a few minutes.

my_data <- plyr::ldply(.data=csv_files, .fun = readr::read_csv)

Now that we have our data, we can begin exploring it.

Data Cleanup


First, I want to examine the columns and see what they are/the data type.


What do we get? A bit of a jumpled output, but we see that we have 5 million rows and 39 columns. A good portion of these we can throw away, so I will just show the data I will keep, as they are usually a meaningless code, or just a column with no data.

Output Summary

ColumnData Example
JobStartDate  chr “5/1/1955 12
JobEndDate  chr “5/1/1955 12
APINumber  num 42317372620000 49009219470000 …
StateNumber  int 42 49 49 49 49 42 42 42 42 37 …
OperatorName  chr “Pioneer Natural Resources”
WellName  chr “Rogers 42 #5” “WILLIAM VALENTINE
Latitude  num 32.3 43 43 42.9 42.9 …
Longitude  num -102 -106 -106 -105 -105 …
Projection  chr “NAD27” “NAD27” “NAD27” “NAD27”
TotalBaseWaterVolume  num NA NA NA NA NA NA NA NA NA NA
TotalBaseNonWaterVolume num NA NA NA NA NA NA NA NA NA NA
StateName  chr “Texas” “WYOMING” “WYOMING”
CountyName  chr “Martin” “CONVERSE” “CONVERSE”
FederalWell  chr “False” “False” “False” “False” …
IndianWell  chr “False” “False” “False” “False” …
TradeName  chr “” “” NA NA …
Supplier  chr “” “” NA NA …
Purpose  chr “” “” “N/A” “N/A” …
IngredientName  chr “” “” “” “” …
PercentHighAdditive  num NA NA NA NA NA NA NA NA NA NA
Table 1: Data Table Variables – Frac Focus


To make the data more manageable, I will take a look at only North Dakota and Montana. There are two state column names, so you can either do some searching, or you can use the StateCode column, which is 33 for North Dakota and 25 for Montana. I will also be removing the APINumber column and replacing it with the character version in a new column called API. I’ll also delete the useless columns, as well as rename some states (as this is inconsistently tagged if you dig deeper).

my_data <- my_data %>%
  mutate(API = as.character(APINumber)) %>%
  subset(select = -c(APINumber)) %>% 
  filter(StateNumber == 33|StateNumber == 25)

my_data$StateName[my_data$StateNumber == 33] <- 'North Dakota'
my_data$StateName[my_data$StateNumber == 25] <- 'Montana'

my_data <- my_data %>% 
  subset(select = -c(UploadKey, StateNumber, CountyNumber,
  Source, DTMOD, PurposeKey, SystemApproach,
  IsWater, PurposePercentHFJob, PurposeIngredientMSDS,
  IngredientKey, CASNumber, IngredientComment,
  IngredientMSDS, MassIngredient, ClaimantCompany,

The other thing I want to do is remove any wells where the total fluid volumes are nonexistent.

my_data <- my_data %>% 
  filter(!|TotalBaseWaterVolume > 0)

Map Filter

I am also interested in the Bakken in particular, so let me map the data points to visually filter out locations. I will use leaflet. We also have to adjust the projections, as it varies, to all be the same coordinate system. I will be changing to WGS84 and adding in wells identified as being on Federal lands.

my_data$coords <- paste0(my_data$Longitude, ',', my_data$Latitude)

nad27 <- my_data %>% 
  select(API, Longitude, Latitude, Projection, coords) %>%
  distinct() %>% filter(Projection == 'NAD27')

nad83 <- my_data %>% 
  select(API, Longitude, Latitude, Projection, coords) %>%
  distinct() %>% filter(Projection == 'NAD83')

wgs84 <- my_data %>% 
  select(API, Longitude, Latitude, Projection, coords) %>%
  distinct() %>% filter(Projection == 'WGS84')

#Reproject NAD27
df = st_as_sf(nad27 %>% select(API, Longitude, Latitude, coords) %>% 
                distinct(), coords = c("Longitude", "Latitude"), 
              crs = 4267) %>% st_transform(4326)

#Reproject NAD83
df1 = st_as_sf(nad83 %>% select(API, Longitude, Latitude, coords) %>% 
                distinct(), coords = c("Longitude", "Latitude"), 
              crs = 4269) %>% st_transform(4326)
df2 = st_as_sf(wgs84 %>% select(API, Longitude, Latitude, coords) %>% 
                distinct(), coords = c("Longitude", "Latitude"), 
              crs = 4326)

df <- rbind(df, df1, df2)
rm(df1, df2)

#Create duplicate point
df$geometry1 <- df$geometry

#Trick R into thinking two points are a line.
#Makes it easier to plot in leaflet

df <- df %>% rowwise() %>% 
  mutate(line = c(geometry, geometry1) %>%
     st_combine() %>% 
     st_cast("LINESTRING")) %>% 

#Add Federal Acreage
df <- df %>%
  left_join(my_data %>% select(API, FederalWell) %>% distinct()) %>%
  mutate(FederalWell = replace(FederalWell, FederalWell == "True", "red"),
         FederalWell = replace(FederalWell, FederalWell == "False", "green"))

leaflet(options = 
          leafletOptions(zoomControl = FALSE)) %>%
  addTiles() %>%
  addPolylines(data = df$line, 
               fillColor = 'transparent', 
               color = df$FederalWell, 
               fillOpacity = 0.5, 
               popup = df$coords) %>%
  leaflet::addLegend(position = 'topright', colors = c('red', 'green'),
                     labels = c('Fed', 'Fee'), title = 'Wells')

Figure 1: Frac Focus North Dakota/Montana Locations

By clicking on individual points, I can see what the lat long coordinates are. To subset just Bakken area, I find I need to select Latitude > 46.6 and Longitude between -105.54 and -101.65.

my_data <- my_data %>% 
  filter(Latitude >= 45.6) %>%
  filter(Longitude <= -101.65) %>%
  filter(Longitude >= -105.54)

Quick Stats

First thing I’ll do is look at some quick stats. I’m going to use my favorite charting library, Highcharts, to plot it up.

Well Counts

stateCount <- my_data %>% 
  select(StateName, API) %>% 
  distinct() %>% 
  group_by(StateName) %>%

Figure 2: Frac Focus North Dakota/Montana Well Counts

Federal Well Counts

fedCount <- my_data %>% 
  select(FederalWell, API) %>% 
  distinct() %>% 
  group_by(FederalWell) %>%

Figure 3: Frac Focus North Dakota/Montana Federal Well Count

For those of us that are concerned with a Federal frac ban, we see that around 12.5% of the wells are concentrated on Federal acreage.

Tribal Well Counts

tribalCount <- my_data %>% 
  select(IndianWell, API) %>% 
  distinct() %>% 
  group_by(IndianWell) %>%

Figure 4: Frac Focus North Dakota/Montana Tribal Well Count

North Dakota is known for having tribal lands within the Bakken region, and the data confirms.


How does our operator list look? This should be our original operator, so if you want a quick way to figure that out, this is a good source.

operator <- my_data %>% 
  select(OperatorName, API) %>% 
  distinct() %>% 
  group_by(OperatorName) %>% 
  count() %>% arrange(desc(n))

Figure 5: Frac Focus North Dakota/Montana Operator Well Count

Seems to be farily respresentative coverage of the group.

Vertical Depth

While we don’t have reservoir names, we can figure out some of them by using text searching. Historically, many Three Forks well names have the string “TF” in them, so I will search for these locations. I can locate these wells, and then create a contour map of vertical depth.

tst1 <- my_data %>% filter(grepl('TF', WellName)) %>% 
   filter(! %>% 
   select(API,TVD, Latitude, Longitude) %>%

This results in roughly 1500 wells, which is sufficient enough for mapping purposes.

Figure 6: Three Forks TVD from Frac Focus


The main things I want to grab from this data is total frac fluid used, proppant, and then see if I can split the proppant into different subcomponents. This data is quite extensive and you can spend forever getting as granular as you like (20/40, Ottawa, etc.), but I will just be showing a few things you can do.


To get the water used, you just need the TotalWaterBaseVolume. However, in the broader scheme, I care about four columns, TotalWaterBaseVolume, Purpose, IngredientName, and PercentHFJob. With these I can calculate total proppant.

water <- my_data %>%
  filter(grepl('BASE', toupper(Purpose))|
        grepl('CARRIER', toupper(Purpose))|
        toupper(Purpose) == 'WATER') %>%
  filter(nchar(Purpose) <= 100) %>% 
  filter(!grepl('Oil', Purpose)) %>% 
  group_by(API) %>% 
  filter(PercentHFJob == max(PercentHFJob)) %>%
  ungroup() %>% filter(PercentHFJob >= 70)%>%
  filter(PercentHFJob < 100)

water2 <- my_data %>% filter(!API %in% water$API) %>%
     filter(grepl('WATER', toupper(IngredientName))) %>%
     group_by(API) %>% 
     filter(PercentHFJob == max(PercentHFJob)) %>%
     ungroup() %>% filter(PercentHFJob >= 70)%>%
  filter(PercentHFJob < 100)

water <- water %>% rbind(water2)

What you need to do is look for Base or Carrier fluid, as this usually corresponds to to the TotalWaterBaseVolume column. I then remove one of the results, and then filter so that I find the value with the highest possible PercentHFJob. I then want to eliminate low values too. This leaves me with a sample set of around 8722 wells, which is somewhere around 75% of the well sample.

To fill it out (since I am missing a good portion), I am going to just look for Water in the IngredientName and perform a similar filter, hoping that this is sufficient. My total well count is now up to 10,291, or ~88%.


From a simplicity perspective, we will just assume that the remaining slurry is majority sand + other random chemicals/additives. For the most part, it is usually just sand.

water <- water %>% 
  mutate(Sand = 100-PercentHFJob) %>%
  mutate(proppant = TotalBaseWaterVolume/(PercentHFJob/100)*8.33*(Sand/100))%>% 
  filter(proppant > 500000) %>%
  filter(proppant < 40000000)

This leaves me with approximately 10,046 wells with proppant and water data, or a pretty large portion of the wells in the state. That is pretty much the last step for basic proppant calculation. If I wanted to get more granular, I have to define text rules for individual proppant lines and it is very time consuming and arduous, so outside the scope of this insight.


We can do other things like text searching and aggregating to estimate various things. For example, we can scope out who has been supplying sand over time on the frac jobs.

Figure 7: Total Sand Deliveries (Estimate) from Frac Focus Data

Or which operators have been using most sand per well…..

Figure 8: Total Sand by Operator

Thanks for reading. Hope you enjoyed this summary and maybe learned a thing or two.

4 thoughts on “How to Interpret Frac Focus Data in the Bakken”

  1. I am having a tough time recreating your Three Forks TVD map. Is the process you used similar to how you mapped well density in the Permian frac ban post?

    1. Hey Brett let me find some time today to track it down. The processes aren’t similar actually; the one just builds a heat map by clustering points. This one is what you need to do to use values for each point instead.

      1. Brett, you’re actually correct. I showed how to get the data so you can take a look, but I purposefully left out how to do the contour map as it was pretty complicated this early on in the posts. I’ll promise to work up a whole new one on doing it this week.

      2. Sounds great!

        Thanks again for all the time you have put into these posts. I can’t say that I knew I wanted to learn R a few weeks ago, but now that I have been working through your examples, I have really enjoyed it. Looking forward to the next one!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: