Home »
Python
Python Pandas Data Operations (With Examples)
Various Data Operations in Python Pandas: In this tutorial, we will learn about some of the very useful operations with explanations and examples.
By Sapna Deraje Radhakrishna, on February 03, 2020
Pandas support very useful operations which are illustrated below,
Consider the below dataFrame,
import numpy as np
import pandas as pd
df = pd.DataFrame({
'col1': [1, 2, 3, 4],
'col2': [444, 555, 666, 444],
'col3': ['abc', 'def', 'ghi', 'xyz']
})
print(df.head())
'''
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
'''
1. Finding unique values in a DataFrame
In order to find unique values from columns,
# returns numpy array of all unique values
print(df['col2'].unique() )
# Output: array([444, 555, 666])
# returns length / number of unique values
# in a numpy array
print(df['col2'].nunique())
# Output: 3
# if we want the table of the unique values
# and how many times they show up
print(df['col2'].value_counts() )
'''
Output:
444 2
555 1
666 1
Name: col2, dtype: int64
'''
2. Selecting data from a DataFrame
Consider the dataFrame,
Using the conditional selection, we could select data as follows,
print(df['col1']>2)
'''
Output:
0 False
1 False
2 True
3 True
Name: col1, dtype: bool
'''
print(df[(df['col1']>2)])
'''
Output:
col1 col2 col3
2 3 666 ghi
3 4 444 xyz
'''
print(df[df['col1']>2 & (df['col2']==44)])
'''
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
'''
3. Applied Methods
Consider a simple method,
def times2(x):
return x*2
We already are aware that we can grab a column and call a built-in function off of it. Such as below,
print(df['col1'].sum())
# Output: 10
Now, in order to apply the custom function, such as one defined above (times2), pandas provide an option to do that as well, as explained below,
print(df['col2'].apply(times2))
'''
Output:
0 888
1 1110
2 1332
3 888
Name: col2, dtype: int64
'''
Apply built-in functions,
print(df['col3'].apply(len))
'''
Output:
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
'''
Apply method will be more powerful, when combined with lambda expressions. For instance,
print(df['col2'].apply(lambda x: x*2))
'''
Output:
0 888
1 1110
2 1332
3 888
Name: col2, dtype: int64
'''
4. Getting column names, RangeIndex, and sort values
# returns the columns names
print(df.columns)
# Output: Index(['col1', 'col2', 'col3'], dtype='object')
#since this is a rangeindex, it actually reports
# start, stop and step values too
print(df.index)
# Output: RangeIndex(start=0, stop=4, step=1)
# sort by column
print(df.sort_values('col2'))
'''
Output:
col1 col2 col3
0 1 444 abc
3 4 444 xyz
1 2 555 def
2 3 666 ghi
'''
In the above result, note that the index values doesn't change, this is to ensure that the values is retained.
5. Checking null values in a DataFrame using isnul()
# isnull
print(df.isnull())
'''
Output
col1 col2 col3
0 False False False
1 False False False
2 False False False
3 False False False
'''
The isnull() will return a dataframe of booleans indicating whether or not the value was null or not. In the above, we get a boolean of all false because we have nulls in our dataframe.
6. Dropping NaN values in a DataFrame using dropna()
print(df.dropna())
'''
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
'''
7. Replacing NaN values with custom values in a DataFrame
df = pd.DataFrame({
'col1': [1, 2, 3, np.nan],
'col2': [np.nan, 555, 666, 444],
'col3': ['abc', 'def', 'ghi', 'xyz']
})
print(df)
'''
Output:
col1 col2 col3
0 1.0 NaN abc
1 2.0 555.0 def
2 3.0 666.0 ghi
3 NaN 444.0 xyz
'''
print(df.fillna('FILL'))
'''
Output:
col1 col2 col3
0 1 FILL abc
1 2 555 def
2 3 666 ghi
3 FILL 444 xyz
'''
8. Using pivot table in a DataFrame
This methodology will be familiar for the Advanced Excel users. Consider a new dataFrame,
data = {
'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': ['one', 'one', 'two', 'two', 'one', 'one'],
'C': ['x', 'y', 'x', 'y', 'x', 'y'],
'D': [1, 3, 2, 5, 4, 1]
}
df = pd.DataFrame(data)
print(df)
'''
Output:
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
'''
The pivot table, creates a multi index dataFrame. The pivot table takes three main arguments, the values, the index and the columns.
print(df.pivot_table(values='D',index=['A', 'B'],columns=['C']))
'''
Output:
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN
'''