How to start using Pandas immediately for Earth Data Analysis (codes included)
This tutorial gives a brief description of scientific computing using Pandas by introducing Series, DataFrame, Pandas common operations, methods, conditional operations, groupby operations, and reading and writing data
Pandas (stands for Panel Data library) is an open-source library for Python that comes with many tools that make the task of data analysis several folds easier. It is built upon the Numpy Library in Python. For details on Numpy, please check my previous post:
In this tutorial, we will focus on two main data structures Pandas has - Series and DataFrames. We will see how we can perform groupby
operations and read/write tabular data directly using Pandas.
Pandas Series
Pandas series is similar to numpy array but it can have ”named index”. Naming indexes helps a lot in data analysis.
Similar posts
Let’s create a series:
Create Pandas Series: Method 1 (detailed way)
import numpy as np
import pandas as pd
np.random.seed(0)
labels = ['x', 'y', 'z' ]
arr = np.random.randn(3)
This gives a numpy array:
[1.76405235 0.40015721 0.97873798]
Now, let’s create a dictionary using the above labels and array:
mydict = {lab:val for lab, val in zip(labels, arr)}
print(mydict)
=>
{'x': 1.764052345967664, 'y': 0.4001572083672233, 'z': 0.9787379841057392}
We can simply use this dictionary to create a Pandas series
myseries = pd.Series(data = mydict)
print(myseries)
=>
x 1.764052
y 0.400157
z 0.978738
dtype: float64
Create Pandas Series: Method 2
We can also create Pandas series directly:
myseries2 = pd.Series(data = arr, index = labels)
print(myseries2)
=>
x 1.764052
y 0.400157
z 0.978738
dtype: float64
Here, our data was all float (also interpreted by Pandas). But we can mix different data types together and Pandas will still be fine.
Extract information from Pandas Series
We can use the index to extract the information from the series.
print(myseries2['x'])
=>
1.764052345967664
Operations on Pandas Series
Let us define two series and perform operations on them
labels1 = ['x', 'y', 'z' ]
arr1 = np.random.randn(3)
labels2 = ['xx', 'y', 'z' ]
arr2 = np.random.randn(3)
myseries1 = pd.Series(data = arr1, index = labels1)
myseries2 = pd.Series(data = arr2, index = labels2)
print("myseries1: \n",myseries1)
print("myseries2: \n",myseries2)
=>
myseries1:
x -0.187184
y 1.532779
z 1.469359
dtype: float64
myseries2:
xx 0.154947
y 0.378163
z -0.887786
dtype: float64
Now, let’s sum the two above series
myseries = myseries1 + myseries2
print(myseries)
=>
x NaN
xx NaN
y 1.910942
z 0.581573
dtype: float64
You can notice that the pandas is quite forgiving. Even though there was no data for xx
in myseries1
and x
in myseries2
, Pandas did not throw errors but fill the missing values with NaN for us.
Pandas DataFrames
When we combine multiple series with common index, then we get DataFrame.
Create DataFrame
from random matrix
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat)
print(mydf)
⇒
0 1 2
0 0.955057 0.190794 1.978757
1 2.605967 0.683509 0.302665
2 1.693723 -1.706086 -1.159119
3 -0.134841 0.390528 0.166905
4 0.184502 0.807706 0.072960
Now, let us try to name the index and columns
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf)
This will return
C1 C2 C3
M 0.386030 2.084019 -0.376519
N 0.230336 0.681209 1.035125
X -0.031160 1.939932 -1.005187
Y -0.741790 0.187125 -0.732845
Z -1.382920 1.482495 0.961458
Extract Series from Pandas DataFrame
print(mydf['C1'])
⇒
M 0.386030
N 0.230336
X -0.031160
Y -0.741790
Z -1.382920
Name: C1, dtype: float64
columnList = ['C1', 'C2']
print(mydf[columnList])
⇒
C1 C2
M 0.386030 2.084019
N 0.230336 0.681209
X -0.031160 1.939932
Y -0.741790 0.187125
Z -1.382920 1.482495
Add new column to the existing Pandas Dataframe
Now, let us add a new column to the existing dataframe.
mydf['C12'] = mydf['C1'] + mydf['C2']
print(mydf)
C1 C2 C3 C12
M 0.386030 2.084019 -0.376519 2.470049
N 0.230336 0.681209 1.035125 0.911546
X -0.031160 1.939932 -1.005187 1.908772
Y -0.741790 0.187125 -0.732845 -0.554665
Z -1.382920 1.482495 0.961458 0.099575
Remove the existing column or index from Pandas Dataframe
mydf.drop('C12', axis=1, inplace=True) #by default axis=0
print(mydf.head())
C1 C2 C3
M 0.386030 2.084019 -0.376519
N 0.230336 0.681209 1.035125
X -0.031160 1.939932 -1.005187
Y -0.741790 0.187125 -0.732845
Z -1.382920 1.482495 0.961458
Two things to notice here. Pandas by default take the index names to drop and it does not permanently remove the index or column unless specified. If we want to remove the index then we provide axis=0
with the index name.
mydf.drop('Z', inplace=True) #by default axis=0
print(mydf.head())
⇒
C1 C2 C3
M 0.386030 2.084019 -0.376519
N 0.230336 0.681209 1.035125
X -0.031160 1.939932 -1.005187
Y -0.741790 0.187125 -0.732845
loc
and iloc
method to extract values from the Pandas Dataframe
We can use loc
or iloc
method to extract the row from the DataFrame:
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())
print(mydf.loc['M']) #uses the column name
print(mydf.iloc[0]) #uses index
C1 C2 C3
M -0.971393 -1.522333 1.133703
N 0.528187 0.393461 -0.630507
X -1.398290 -0.219311 -0.045676
Y 0.012421 0.093628 1.240813
Z -1.097693 -1.908009 -0.380104
C1 -0.971393
C2 -1.522333
C3 1.133703
Name: M, dtype: float64
C1 -0.971393
C2 -1.522333
C3 1.133703
Name: M, dtype: float64
We can also extract multiple rows by using a list of rows:
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())
print(mydf.loc[['M', 'X']])
print(mydf.iloc[[0, 2]])
C1 C2 C3
M -1.666059 -2.736995 1.522562
N 0.178009 -0.626805 -0.391089
X 1.743477 1.130018 0.897796
Y 0.330866 -1.063049 -0.125381
Z -0.945588 2.029544 -1.046358
C1 C2 C3
M -1.666059 -2.736995 1.522562
X 1.743477 1.130018 0.897796
C1 C2 C3
M -1.666059 -2.736995 1.522562
X 1.743477 1.130018 0.897796
How can we extract the columns?
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())
print(mydf.loc[['M', 'N'],'C1'])
print(mydf.iloc[[0, 1],0])
C1 C2 C3
M -1.549671 0.435253 1.259904
N -0.447898 0.266207 0.412580
X 0.988773 0.513833 -0.928205
Y 0.846904 -0.298436 0.029141
Z 0.889031 -1.839261 0.863596
M -1.549671
N -0.447898
Name: C1, dtype: float64
M -1.549671
N -0.447898
Name: C1, dtype: float64
Conditional Operations on Pandas DataFrames
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())
C1 C2 C3
M 1.035125 -0.031160 1.939932
N -1.005187 -0.741790 0.187125
X -0.732845 -1.382920 1.482495
Y 0.961458 -2.141212 0.992573
Z 1.192241 -1.046780 1.292765
Now, we want to find all the values in the DataFrame that are positive and mask all others. This can be easily done using:
mydf = mydf[mydf>0]
print(mydf.head())
C1 C2 C3
M 1.035125 NaN 1.939932
N NaN NaN 0.187125
X NaN NaN 1.482495
Y 0.961458 NaN 0.992573
Z 1.192241 NaN 1.292765
We can do the same thing for just a particular column:
mydf = mydf[mydf['C2']>0]
print(mydf.head())
C1 C2 C3
M 0.649148 0.358941 -1.080471
N 0.902398 0.161781 0.833029
Y -0.708954 0.586847 -1.621348
Z 0.677535 0.026105 -1.678284
We can use the value_counts()
method (for series) for counting the number of C2>0. There are many other ways we can retrieve such information.
We can also use more than one condition. Please note that in this case I regenerated the mydf
using the numpy seed of 0.
np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())
C1 C2 C3
M 1.764052 0.400157 0.978738
N 2.240893 1.867558 -0.977278
X 0.950088 -0.151357 -0.103219
Y 0.410599 0.144044 1.454274
Z 0.761038 0.121675 0.443863
mydf2 = mydf[(mydf['C1']>0) & (mydf['C2']>1)]
print(mydf2.head())
C1 C2 C3
N 2.240893 1.867558 -0.977278
Pandas conditioning operations uses “&” and “ | ”. |
Create new index for the dataframe
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())
C1 C2 C3
M 1.764052 0.400157 0.978738
N 2.240893 1.867558 -0.977278
X 0.950088 -0.151357 -0.103219
Y 0.410599 0.144044 1.454274
Z 0.761038 0.121675 0.443863
Now, we create a new column with new index values
my_new_idx = "MM NN XX YY ZZ".split()
mydf['newidx'] = my_new_idx
print(mydf.head())
C1 C2 C3 newidx
M 1.764052 0.400157 0.978738 MM
N 2.240893 1.867558 -0.977278 NN
X 0.950088 -0.151357 -0.103219 XX
Y 0.410599 0.144044 1.454274 YY
Z 0.761038 0.121675 0.443863 ZZ
Let’s make the column “newidx” as the index (we will lose our original index):
mydf.set_index('newidx', inplace=True)
print(mydf.head())
newidx C1 C2 C3
MM 1.764052 0.400157 0.978738
NN 2.240893 1.867558 -0.977278
XX 0.950088 -0.151357 -0.103219
YY 0.410599 0.144044 1.454274
ZZ 0.761038 0.121675 0.443863
Pandas Common Methods and attributes
df.info
mydf.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, M to Z
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 C1 5 non-null float64
1 C2 5 non-null float64
2 C3 5 non-null float64
dtypes: float64(3)
memory usage: 160.0+ bytes
df.dtypes
mydf.dtypes
C1 float64
C2 float64
C3 float64
dtype: object
df.describe()
mydf.describe()
C1 C2 C3
count 5.000000 5.000000 5.000000
mean 1.225334 0.476415 0.359276
std 0.754437 0.801795 0.947389
min 0.410599 -0.151357 -0.977278
25% 0.761038 0.121675 -0.103219
50% 0.950088 0.144044 0.443863
75% 1.764052 0.400157 0.978738
max 2.240893 1.867558 1.454274
Apply (apply
) method
np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())
def square(number):
return number**2
print(mydf['C1'].apply(square))
C1 C2 C3
M 1.764052 0.400157 0.978738
N 2.240893 1.867558 -0.977278
X 0.950088 -0.151357 -0.103219
Y 0.410599 0.144044 1.454274
Z 0.761038 0.121675 0.443863
M 3.111881
N 5.021602
X 0.902668
Y 0.168591
Z 0.579178
Name: C1, dtype: float64
Sort DataFrame by columns
np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())
print(mydf.sort_values(by='C2'))
C1 C2 C3
M 1.764052 0.400157 0.978738
N 2.240893 1.867558 -0.977278
X 0.950088 -0.151357 -0.103219
Y 0.410599 0.144044 1.454274
Z 0.761038 0.121675 0.443863
C1 C2 C3
X 0.950088 -0.151357 -0.103219
Z 0.761038 0.121675 0.443863
Y 0.410599 0.144044 1.454274
M 1.764052 0.400157 0.978738
N 2.240893 1.867558 -0.977278
Pandas Groupby Operations
Sometimes in Pandas dataframe we want to combine some columns based on some criteria. Groupby method can do this task elegantly by performing “split”, “apply”, and “combine” operations under the hood. It will be more clear with some examples.
np.random.seed(0)
mymat = np.random.randn(10,3)
mydf = pd.DataFrame(data=mymat, columns="C1 C2 C3".split())
mydf['names'] = "M N X Y Z".split() * 2
print(mydf)
C1 C2 C3 names
0 1.764052 0.400157 0.978738 M
1 2.240893 1.867558 -0.977278 N
2 0.950088 -0.151357 -0.103219 X
3 0.410599 0.144044 1.454274 Y
4 0.761038 0.121675 0.443863 Z
5 0.333674 1.494079 -0.205158 M
6 0.313068 -0.854096 -2.552990 N
7 0.653619 0.864436 -0.742165 X
8 2.269755 -1.454366 0.045759 Y
9 -0.187184 1.532779 1.469359 Z
mydf.groupby('names').mean()
names C1 C2 C3
M 1.048863 0.947118 0.386790
N 1.276980 0.506731 -1.765134
X 0.801854 0.356539 -0.422692
Y 1.340177 -0.655161 0.750016
Z 0.286927 0.827227 0.956611
We can replace the “mean” method with any other aggregate method such as sum, max, std, etc.
Another example with the “describe” method:
print(mydf.groupby('names').describe().transpose())
names M N X Y Z
C1 count 2.000000 2.000000 2.000000 2.000000 2.000000
mean 1.048863 1.276980 0.801854 1.340177 0.286927
std 1.011430 1.363178 0.209636 1.314622 0.670494
min 0.333674 0.313068 0.653619 0.410599 -0.187184
25% 0.691269 0.795024 0.727736 0.875388 0.049872
50% 1.048863 1.276980 0.801854 1.340177 0.286927
75% 1.406458 1.758937 0.875971 1.804966 0.523982
max 1.764052 2.240893 0.950088 2.269755 0.761038
C2 count 2.000000 2.000000 2.000000 2.000000 2.000000
mean 0.947118 0.506731 0.356539 -0.655161 0.827227
std 0.773520 1.924500 0.718274 1.130246 0.997801
min 0.400157 -0.854096 -0.151357 -1.454366 0.121675
25% 0.673638 -0.173682 0.102591 -1.054763 0.474451
50% 0.947118 0.506731 0.356539 -0.655161 0.827227
75% 1.220599 1.187145 0.610488 -0.255559 1.180003
max 1.494079 1.867558 0.864436 0.144044 1.532779
C3 count 2.000000 2.000000 2.000000 2.000000 2.000000
mean 0.386790 -1.765134 -0.422692 0.750016 0.956611
std 0.837141 1.114197 0.451803 0.995971 0.725135
min -0.205158 -2.552990 -0.742165 0.045759 0.443863
25% 0.090816 -2.159062 -0.582428 0.397887 0.700237
50% 0.386790 -1.765134 -0.422692 0.750016 0.956611
75% 0.682764 -1.371206 -0.262955 1.102145 1.212985
max 0.978738 -0.977278 -0.103219 1.454274 1.469359
Read and Write Data with Pandas
For details on how to read and write tabular data with the help of Pandas, please refer to the Pandas documentation here.
The most common ones you would like to familiarize yourself are - read_csv
, to_csv
,
Let us start by writing our dataframe into a file myexample.csv
np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
mydf.to_csv('myexample.csv', index=False)
cat myexample.csv
C1,C2,C3
1.764052345967664,0.4001572083672233,0.9787379841057392
2.240893199201458,1.8675579901499675,-0.977277879876411
0.9500884175255894,-0.1513572082976979,-0.10321885179355784
0.41059850193837233,0.144043571160878,1.454273506962975
0.7610377251469934,0.12167501649282841,0.44386323274542566
Now, lets read that file using the read_csv
mydf2 = pd.read_csv('myexample.csv')
print(mydf2.head())
C1 C2 C3
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
Disclaimer of liability
The information provided by the Earth Inversion is made available for educational purposes only.
Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.
UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.
Leave a comment