If I can’t find completion data easily, the best spot to grab it is usually Frac Focus, or better known as “The national hydraulic fracturing chemical disclosure registry”. It has only been around since 2011, so it will not have ALL of the data, but it will be enough for our purposes.
In this post I will show how to gather data and interpret the data using R.
If you don’t care about coding, there are still some cool graphs and maps if you scroll down.
Follow My Blog
First, there are a few packages we are going to use.
- dplyr (and plyr but don’t load that package)
- highcharter – This requires a license to display (which I have).
Next we will download the data and unzip it. I also want to locate the csv files (as of now there are 20+) that contain the basic chemical data.
options(stringsAsFactors = FALSE) options(scipen = 999) #Download the Data downloader::download('https://fracfocusdata.org/digitaldownload/fracfocuscsv.zip', dest='dataset.zip', mode='wb') #Unzip data unzip('dataset.zip') #remove zip file unlink('dataset.zip') #Locate CSV Files csv_files <- list.files(path=getwd(), pattern = '*.csv') csv_files <- csv_files[grepl('FracFocusR', csv_files)]
The first two lines tell us that text needs to be read in as a character type instead of factor, which will really bog down our machine. The second moves away from scientific notation, which gets annoying because it reads in API number as a numeric point. I usually convert it to character afterwards.
Load Data to R
Now that I know where the files are, I can use plyr to download the data into a dataframe. This process might take a few minutes.
library(readr) my_data <- plyr::ldply(.data=csv_files, .fun = readr::read_csv) unlink(csv_files)
Now that we have our data, we can begin exploring it.
First, I want to examine the columns and see what they are/the data type.
What do we get? A bit of a jumpled output, but we see that we have 5 million rows and 39 columns. A good portion of these we can throw away, so I will just show the data I will keep, as they are usually a meaningless code, or just a column with no data.
|JobStartDate||chr “5/1/1955 12|
|JobEndDate||chr “5/1/1955 12|
|APINumber||num 42317372620000 49009219470000 …|
|StateNumber||int 42 49 49 49 49 42 42 42 42 37 …|
|OperatorName||chr “Pioneer Natural Resources”|
|WellName||chr “Rogers 42 #5” “WILLIAM VALENTINE|
|Latitude||num 32.3 43 43 42.9 42.9 …|
|Longitude||num -102 -106 -106 -105 -105 …|
|Projection||chr “NAD27” “NAD27” “NAD27” “NAD27”|
|TVD||num NA NA NA NA NA NA NA NA NA NA|
|TotalBaseWaterVolume||num NA NA NA NA NA NA NA NA NA NA|
|TotalBaseNonWaterVolume||num NA NA NA NA NA NA NA NA NA NA|
|StateName||chr “Texas” “WYOMING” “WYOMING”|
|CountyName||chr “Martin” “CONVERSE” “CONVERSE”|
|FederalWell||chr “False” “False” “False” “False” …|
|IndianWell||chr “False” “False” “False” “False” …|
|TradeName||chr “” “” NA NA …|
|Supplier||chr “” “” NA NA …|
|Purpose||chr “” “” “N/A” “N/A” …|
|IngredientName||chr “” “” “” “” …|
|PercentHighAdditive||num NA NA NA NA NA NA NA NA NA NA|
|PercentHFJob||num NA NA NA NA NA NA NA NA NA NA|
To make the data more manageable, I will take a look at only North Dakota and Montana. There are two state column names, so you can either do some searching, or you can use the StateCode column, which is 33 for North Dakota and 25 for Montana. I will also be removing the APINumber column and replacing it with the character version in a new column called API. I’ll also delete the useless columns, as well as rename some states (as this is inconsistently tagged if you dig deeper).
my_data <- my_data %>% mutate(API = as.character(APINumber)) %>% subset(select = -c(APINumber)) %>% filter(StateNumber == 33|StateNumber == 25) my_data$StateName[my_data$StateNumber == 33] <- 'North Dakota' my_data$StateName[my_data$StateNumber == 25] <- 'Montana' my_data <- my_data %>% subset(select = -c(UploadKey, StateNumber, CountyNumber, Source, DTMOD, PurposeKey, SystemApproach, IsWater, PurposePercentHFJob, PurposeIngredientMSDS, IngredientKey, CASNumber, IngredientComment, IngredientMSDS, MassIngredient, ClaimantCompany, DisclosureKey))
The other thing I want to do is remove any wells where the total fluid volumes are nonexistent.
my_data <- my_data %>% filter(!is.na(TotalBaseWaterVolume)|TotalBaseWaterVolume > 0)
I am also interested in the Bakken in particular, so let me map the data points to visually filter out locations. I will use leaflet. We also have to adjust the projections, as it varies, to all be the same coordinate system. I will be changing to WGS84 and adding in wells identified as being on Federal lands.
my_data$coords <- paste0(my_data$Longitude, ',', my_data$Latitude) nad27 <- my_data %>% select(API, Longitude, Latitude, Projection, coords) %>% distinct() %>% filter(Projection == 'NAD27') nad83 <- my_data %>% select(API, Longitude, Latitude, Projection, coords) %>% distinct() %>% filter(Projection == 'NAD83') wgs84 <- my_data %>% select(API, Longitude, Latitude, Projection, coords) %>% distinct() %>% filter(Projection == 'WGS84') #Reproject NAD27 df = st_as_sf(nad27 %>% select(API, Longitude, Latitude, coords) %>% distinct(), coords = c("Longitude", "Latitude"), crs = 4267) %>% st_transform(4326) rm(nad27) #Reproject NAD83 df1 = st_as_sf(nad83 %>% select(API, Longitude, Latitude, coords) %>% distinct(), coords = c("Longitude", "Latitude"), crs = 4269) %>% st_transform(4326) rm(nad83) df2 = st_as_sf(wgs84 %>% select(API, Longitude, Latitude, coords) %>% distinct(), coords = c("Longitude", "Latitude"), crs = 4326) rm(wgs84) df <- rbind(df, df1, df2) rm(df1, df2) #Create duplicate point df$geometry1 <- df$geometry #Trick R into thinking two points are a line. #Makes it easier to plot in leaflet df <- df %>% rowwise() %>% mutate(line = c(geometry, geometry1) %>% st_combine() %>% st_cast("LINESTRING")) %>% ungroup() #Add Federal Acreage df <- df %>% left_join(my_data %>% select(API, FederalWell) %>% distinct()) %>% mutate(FederalWell = replace(FederalWell, FederalWell == "True", "red"), FederalWell = replace(FederalWell, FederalWell == "False", "green")) leaflet(options = leafletOptions(zoomControl = FALSE)) %>% addTiles() %>% addPolylines(data = df$line, fillColor = 'transparent', color = df$FederalWell, fillOpacity = 0.5, popup = df$coords) %>% leaflet::addLegend(position = 'topright', colors = c('red', 'green'), labels = c('Fed', 'Fee'), title = 'Wells')
Figure 1: Frac Focus North Dakota/Montana Locations
By clicking on individual points, I can see what the lat long coordinates are. To subset just Bakken area, I find I need to select Latitude > 46.6 and Longitude between -105.54 and -101.65.
my_data <- my_data %>% filter(Latitude >= 45.6) %>% filter(Longitude <= -101.65) %>% filter(Longitude >= -105.54)
First thing I’ll do is look at some quick stats. I’m going to use my favorite charting library, Highcharts, to plot it up.
stateCount <- my_data %>% select(StateName, API) %>% distinct() %>% group_by(StateName) %>% count()
Figure 2: Frac Focus North Dakota/Montana Well Counts
Federal Well Counts
fedCount <- my_data %>% select(FederalWell, API) %>% distinct() %>% group_by(FederalWell) %>% count()
Figure 3: Frac Focus North Dakota/Montana Federal Well Count
For those of us that are concerned with a Federal frac ban, we see that around 12.5% of the wells are concentrated on Federal acreage.
Tribal Well Counts
tribalCount <- my_data %>% select(IndianWell, API) %>% distinct() %>% group_by(IndianWell) %>% count()
Figure 4: Frac Focus North Dakota/Montana Tribal Well Count
North Dakota is known for having tribal lands within the Bakken region, and the data confirms.
How does our operator list look? This should be our original operator, so if you want a quick way to figure that out, this is a good source.
operator <- my_data %>% select(OperatorName, API) %>% distinct() %>% group_by(OperatorName) %>% count() %>% arrange(desc(n))
Figure 5: Frac Focus North Dakota/Montana Operator Well Count
Seems to be farily respresentative coverage of the group.
While we don’t have reservoir names, we can figure out some of them by using text searching. Historically, many Three Forks well names have the string “TF” in them, so I will search for these locations. I can locate these wells, and then create a contour map of vertical depth.
tst1 <- my_data %>% filter(grepl('TF', WellName)) %>% filter(!is.na(TVD)) %>% select(API,TVD, Latitude, Longitude) %>% distinct()
This results in roughly 1500 wells, which is sufficient enough for mapping purposes.
Figure 6: Three Forks TVD from Frac Focus
The main things I want to grab from this data is total frac fluid used, proppant, and then see if I can split the proppant into different subcomponents. This data is quite extensive and you can spend forever getting as granular as you like (20/40, Ottawa, etc.), but I will just be showing a few things you can do.
To get the water used, you just need the TotalWaterBaseVolume. However, in the broader scheme, I care about four columns, TotalWaterBaseVolume, Purpose, IngredientName, and PercentHFJob. With these I can calculate total proppant.
water <- my_data %>% filter(grepl('BASE', toupper(Purpose))| grepl('CARRIER', toupper(Purpose))| toupper(Purpose) == 'WATER') %>% filter(nchar(Purpose) <= 100) %>% filter(!grepl('Oil', Purpose)) %>% group_by(API) %>% filter(PercentHFJob == max(PercentHFJob)) %>% ungroup() %>% filter(PercentHFJob >= 70)%>% filter(PercentHFJob < 100) water2 <- my_data %>% filter(!API %in% water$API) %>% filter(grepl('WATER', toupper(IngredientName))) %>% group_by(API) %>% filter(PercentHFJob == max(PercentHFJob)) %>% ungroup() %>% filter(PercentHFJob >= 70)%>% filter(PercentHFJob < 100) water <- water %>% rbind(water2)
What you need to do is look for Base or Carrier fluid, as this usually corresponds to to the TotalWaterBaseVolume column. I then remove one of the results, and then filter so that I find the value with the highest possible PercentHFJob. I then want to eliminate low values too. This leaves me with a sample set of around 8722 wells, which is somewhere around 75% of the well sample.
To fill it out (since I am missing a good portion), I am going to just look for Water in the IngredientName and perform a similar filter, hoping that this is sufficient. My total well count is now up to 10,291, or ~88%.
From a simplicity perspective, we will just assume that the remaining slurry is majority sand + other random chemicals/additives. For the most part, it is usually just sand.
water <- water %>% mutate(Sand = 100-PercentHFJob) %>% mutate(proppant = TotalBaseWaterVolume/(PercentHFJob/100)*8.33*(Sand/100))%>% filter(proppant > 500000) %>% filter(proppant < 40000000)
This leaves me with approximately 10,046 wells with proppant and water data, or a pretty large portion of the wells in the state. That is pretty much the last step for basic proppant calculation. If I wanted to get more granular, I have to define text rules for individual proppant lines and it is very time consuming and arduous, so outside the scope of this insight.
We can do other things like text searching and aggregating to estimate various things. For example, we can scope out who has been supplying sand over time on the frac jobs.
Figure 7: Total Sand Deliveries (Estimate) from Frac Focus Data
Or which operators have been using most sand per well…..
Figure 8: Total Sand by Operator
Thanks for reading. Hope you enjoyed this summary and maybe learned a thing or two.