Oil & Gas Coding with Python (Part 1)

It seems a member of our EFT community couldn’t just sit by and let me infect you guys with R fever when Python can do much of the same workflow, so I present to you the Python version of the first one, written by @fraclost!

Follow My Blog

Getting Started

Install Anaconda. Open Jupyter notebook within Anaconda. Honestly I installed mine on my computer years ago so I think it’s all packaged up and ready to go.

You should see a screen that allows you to write code into each of the cells. Once you write one of the lines, just push run up top. Ctrl+Enter works as well within each cell.

Figure 1: My Jupyter Notebook Screen

Load some packages

Python is Open Source, which means there are A LOT of packages built by people to do really cool analysis. The main ones people use for analysis are numpy and pandas, as well as matplotlib for graphing (though plotly is pretty good too). I know there is a highcharts library as well sitting around somewhere; just haven’t used it yet.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

Press CTRL+Enter or Run at the top to run each line. From a REPRODUCIBLE CODE standpoint, Jupyter is a running notebook that allows you to save progress.

Download Data

If you’ve followed me so far, you know I dislike relying on paid-for data when we have a wealth of data from individual states (where do you think IHS and Enervus get their data!). For this reason, I can share this stuff with you problem free. For this example, I’ll share monthly production data for all of Antero’s wells in the Northeast that I pulled from their sites.

prod = pd.read_csv('https://github.com/xbrl-data/class/raw/master/prodAntero.csv')

Investigate Data

This should run through Q1 2020. Investigating you data in python is a little bit different than in R. To see the column describtions:

dtypes

prod.dtypes

dypes will tell you the type of each column of data. Basically is it a number, integer, date, column, or a few other random things. Pretty useful. For example, I want to convert date to a datetime classification.

prod['Date'] = pd.to_datetime(prod['Date'])

The other thing I notice when running dtypes is that there is a random variable called Unnamed: 0. I probably screwed up the save, but I’m guessing it’s the row number. We can get rid of it using the drop function from pandas.

#Drop random column
prod = prod.drop('Unnamed: 0', axis = 1)

describe

describe tells us various things about each column and is quite helpful for getting stats.

prod.describe()

It actually gives you some good data (number of NA’s, distributions, etc.). A lot of noise right now but particularly valuable when looking for outliers.

head

head gives you the first rows of data so you can investigate. Also very helpful.

prod.head()

Let’s do something cool!

All right, we’ve learned a little bit, but time to show some real value real quick. Similar to R, we will be grouping and adding new variables.

# create a column "Production Months" and count per distinct API
prod['Months Online'] = prod.groupby('API').cumcount()+1

# generate the first production year for each API
prod['First Prod'] = prod.groupby('API')['Date'].transform('min')

# converting the full date format to just a year
prod['First Prod'] = prod['First Prod'].apply(lambda x: x.year)

# generate a cumulative amount of oil through time
prod['Cumulative Oil'] = prod.groupby('API')['oil'].cumsum()

# generate a cumulative amount of gas through time
prod['Cumulative Gas'] = prod.groupby('API')['gas'].cumsum()

# generate a cumulative MCFE through time
prod['Cumulative MCFE'] = (prod.groupby('API')['oil'].cumsum() *6 )+ prod.groupby('API')['gas'].cumsum()

Essentially, we are just taking each well (API) and then adding a cumulative month, a first production year, and then cumulative volumes.

Now we are going to use it here to visualize our data. I want to look at cumulative production over time, binned by first production year. I’m also going to filter out any wells that came online before 2010.

# generate a summary data set for plotting, average MCFE per month, by year of first production (well vintage)
summaryDF = prod.groupby(['First Prod','Months Online'],as_index=False, sort=False )[['Cumulative MCFE']].mean()

# cut off older wells prior to 2010
summaryDF = summaryDF[summaryDF["First Prod"] > 2009]

# Remove tailing months when well counts start to thin
summaryDF = summaryDF.groupby('First Prod',as_index=False).apply(lambda x : x[x['Months Online'] < x['Months Online'].quantile(.75)])

Now we can plot the results with matplotlib.

fig, ax = plt.subplots()
summaryDF.groupby('First Prod').plot(x='Months Online', y='Cumulative MCFE', ax=ax, legend=True )

# fix issue with incorrect legend showing
ax.legend(summaryDF['First Prod'].unique())

# graph beautification, grid lines, labels, etc...
def millions(x, pos):
    return '%1.0fMM' % (x*1e-6)

# Call the formatting function above for our Y axis
formatter = FuncFormatter(millions)
ax.yaxis.set_major_formatter(formatter)

plt.grid(b=True, which='major', color='#666666', linestyle='-')
ax.set_xlabel('Months Online')
ax.set_ylabel('Cumulative MCFE')
plt.xlim(0,108)
plt.ylim(0,6500000)
plt.show()

And here is the result!

Figure 2: Cumulative MCFE over time for Antero

Summary

And that’s it. Consider me sufficienty chastised for not including the lovely Python version in the first place!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: