It seems a member of our EFT community couldn’t just sit by and let me infect you guys with R fever when Python can do much of the same workflow, so I present to you the Python version of the first one, written by @fraclost!
Follow My Blog
You should see a screen that allows you to write code into each of the cells. Once you write one of the lines, just push run up top. Ctrl+Enter works as well within each cell.
Load some packages
Python is Open Source, which means there are A LOT of packages built by people to do really cool analysis. The main ones people use for analysis are numpy and pandas, as well as matplotlib for graphing (though plotly is pretty good too). I know there is a highcharts library as well sitting around somewhere; just haven’t used it yet.
import pandas as pd import numpy as np import matplotlib import matplotlib.pyplot as plt from matplotlib.ticker import FuncFormatter
Press CTRL+Enter or Run at the top to run each line. From a REPRODUCIBLE CODE standpoint, Jupyter is a running notebook that allows you to save progress.
If you’ve followed me so far, you know I dislike relying on paid-for data when we have a wealth of data from individual states (where do you think IHS and Enervus get their data!). For this reason, I can share this stuff with you problem free. For this example, I’ll share monthly production data for all of Antero’s wells in the Northeast that I pulled from their sites.
prod = pd.read_csv('https://github.com/xbrl-data/class/raw/master/prodAntero.csv')
This should run through Q1 2020. Investigating you data in python is a little bit different than in R. To see the column describtions:
dypes will tell you the type of each column of data. Basically is it a number, integer, date, column, or a few other random things. Pretty useful. For example, I want to convert date to a datetime classification.
prod['Date'] = pd.to_datetime(prod['Date'])
The other thing I notice when running dtypes is that there is a random variable called Unnamed: 0. I probably screwed up the save, but I’m guessing it’s the row number. We can get rid of it using the drop function from pandas.
#Drop random column prod = prod.drop('Unnamed: 0', axis = 1)
describe tells us various things about each column and is quite helpful for getting stats.
It actually gives you some good data (number of NA’s, distributions, etc.). A lot of noise right now but particularly valuable when looking for outliers.
head gives you the first rows of data so you can investigate. Also very helpful.
Let’s do something cool!
All right, we’ve learned a little bit, but time to show some real value real quick. Similar to R, we will be grouping and adding new variables.
# create a column "Production Months" and count per distinct API prod['Months Online'] = prod.groupby('API').cumcount()+1 # generate the first production year for each API prod['First Prod'] = prod.groupby('API')['Date'].transform('min') # converting the full date format to just a year prod['First Prod'] = prod['First Prod'].apply(lambda x: x.year) # generate a cumulative amount of oil through time prod['Cumulative Oil'] = prod.groupby('API')['oil'].cumsum() # generate a cumulative amount of gas through time prod['Cumulative Gas'] = prod.groupby('API')['gas'].cumsum() # generate a cumulative MCFE through time prod['Cumulative MCFE'] = (prod.groupby('API')['oil'].cumsum() *6 )+ prod.groupby('API')['gas'].cumsum()
Essentially, we are just taking each well (API) and then adding a cumulative month, a first production year, and then cumulative volumes.
Now we are going to use it here to visualize our data. I want to look at cumulative production over time, binned by first production year. I’m also going to filter out any wells that came online before 2010.
# generate a summary data set for plotting, average MCFE per month, by year of first production (well vintage) summaryDF = prod.groupby(['First Prod','Months Online'],as_index=False, sort=False )[['Cumulative MCFE']].mean() # cut off older wells prior to 2010 summaryDF = summaryDF[summaryDF["First Prod"] > 2009] # Remove tailing months when well counts start to thin summaryDF = summaryDF.groupby('First Prod',as_index=False).apply(lambda x : x[x['Months Online'] < x['Months Online'].quantile(.75)])
Now we can plot the results with matplotlib.
fig, ax = plt.subplots() summaryDF.groupby('First Prod').plot(x='Months Online', y='Cumulative MCFE', ax=ax, legend=True ) # fix issue with incorrect legend showing ax.legend(summaryDF['First Prod'].unique()) # graph beautification, grid lines, labels, etc... def millions(x, pos): return '%1.0fMM' % (x*1e-6) # Call the formatting function above for our Y axis formatter = FuncFormatter(millions) ax.yaxis.set_major_formatter(formatter) plt.grid(b=True, which='major', color='#666666', linestyle='-') ax.set_xlabel('Months Online') ax.set_ylabel('Cumulative MCFE') plt.xlim(0,108) plt.ylim(0,6500000) plt.show()
And here is the result!
And that’s it. Consider me sufficienty chastised for not including the lovely Python version in the first place!