Home »
Python
Conditional selection in the DataFrame | Pandas DataFrame
Here, we are going to learn about the conditional selection in the Pandas DataFrame in Python, Selection Using multiple conditions, etc.
Submitted by Sapna Deraje Radhakrishna, on January 06, 2020
Conditional selection in the DataFrame
Consider the following example,
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(102)
df = pd.DataFrame(randn(5,4),['P','Q','R','S','T'],['A','B','C','D'])
print(df)
Output
A B C D
P 1.668068 0.925862 1.057997 -0.920339
Q 1.299748 0.331183 -0.509845 -0.903099
R -0.130016 -2.238203 0.973165 -0.024185
S -0.484928 -1.109264 -0.558975 1.042387
T -1.712263 0.136120 -0.464444 0.050980
If we use < symbol on a DataFrame, like >0, the values in the dataFrame is compared against 0 and returned with True/False.
print(df > 0)
'''
Output:
A B C D
P True True True False
Q True True False False
R False False True False
S False False False True
T False True False True
'''
Now, assign the df>0 to a Boolean value called bool_df
bool_df = df > 0
print(bool_df)
'''
Output:
A B C D
P True True True False
Q True True False False
R False False True False
S False False False True
T False True False True
'''
Pass bool_df to df, in the below we can see that the values which were True have their original value and where it is False, we have a NAN. Using this approach, we can use the conditional selection in dataFrame.
print(df[bool_df])
'''
Output:
A B C D
P 1.668068 0.925862 1.057997 NaN
Q 1.299748 0.331183 NaN NaN
R NaN NaN 0.973165 NaN
S NaN NaN NaN 1.042387
T NaN 0.136120 NaN 0.050980
'''
The above can be achieved in single line,
print(df[df>0])
'''
Output:
A B C D
P 1.668068 0.925862 1.057997 NaN
Q 1.299748 0.331183 NaN NaN
R NaN NaN 0.973165 NaN
S NaN NaN NaN 1.042387
T NaN 0.136120 NaN 0.050980
'''
Instead of passing an entire dataFrame, pass only the row/column and instead of returning nulls what that's going to do is return only the rows/columns of a subset of the data frame where the conditions are True.
Take a look at the 'A' column, here the value against 'R', 'S', 'T' are less than 0 hence you get False for those rows,
print(df['A'])
'''
Output:
P 1.668068
Q 1.299748
R -0.130016
S -0.484928
T -1.712263
Name: A, dtype: float64
'''
print(df['A']>0)
'''
Output:
P True
Q True
R False
S False
T False
Name: A, dtype: bool
'''
Use this series of Boolean values corresponding to rows to filter out rows based off of a column values and that means if the series is passed into a dataFrame using bracket notation, the rows of A which have True value will only be returned (no null values returned).
print(df[df['A']>0])
'''
Output:
A B C D
P 1.668068 0.925862 1.057997 -0.920339
Q 1.299748 0.331183 -0.509845 -0.903099
'''
Consider some more examples of conditional selection, grab all rows from df where D<0
print(df)
'''
Output:
A B C D
P 1.668068 0.925862 1.057997 -0.920339
Q 1.299748 0.331183 -0.509845 -0.903099
R -0.130016 -2.238203 0.973165 -0.024185
S -0.484928 -1.109264 -0.558975 1.042387
T -1.712263 0.136120 -0.464444 0.050980
'''
Retrieving the subset dataFrame, step wise
print(df[df['D']<0]) #subset dataFrame
'''
Output:
A B C D
P 1.668068 0.925862 1.057997 -0.920339
Q 1.299748 0.331183 -0.509845 -0.903099
R -0.130016 -2.238203 0.973165 -0.024185
'''
result_df = df[df['D']<0]
print(result_df)
'''
Output
A B C D
P 1.668068 0.925862 1.057997 -0.920339
Q 1.299748 0.331183 -0.509845 -0.903099
R -0.130016 -2.238203 0.973165 -0.024185
'''
print(result_df['B'])
'''
Output:
P 0.925862
Q 0.331183
R -2.238203
Name: B, dtype: float64
'''
Retrieving the subset dataFrame, in single step
print(df[df['D']<0]['B'])
'''
Output:
P 0.925862
Q 0.331183
R -2.238203
Name: B, dtype: float64
'''
Retrieving multiple columns from dataFrame
print(df[df['D']<0][['B','C']])
'''
Output:
B C
P 0.925862 1.057997
Q 0.331183 -0.509845
R -2.238203 0.973165
'''
Though it seems to be a little confusing to use one-liners, it is a preferred way, since using multiple steps the code takes more memory with each variable defined. However, until one is comfortable it is good to break it down to multiple steps.
Selection Using multiple conditions
The normal approach in python to implement multiple conditions is by using 'and' operator. However, if we use the 'and' operator in the pandas function we get an 'ValueError: The truth value of a Series is ambiguous.' Consider the below example
print(df[(df['D']<0) and (df['A']>0)])
'''
Output:
Traceback (most recent call last):
File "main.py", line 31, in <module>
print(df[(df['D']<0) and (df['A']>0)])
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/pandas/core/generic.py", line 1555, in __nonzero__
self.__class__.__name__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
'''
The reason for the above error is, in python the 'and' operator can deal with the single instance of Boolean values and not multiple instances. The df['D']<0 results in multiple instances of Boolean value, as shown below,
print(df['D']<0)
'''
Output:
P True
Q True
R True
S False
T False
Name: D, dtype: bool
'''
In Pandas, in order to use and logical operation we have to use &
print(df[(df['D']<0) & (df['A']>0)])
'''
Output:
A B C D
P 1.668068 0.925862 1.057997 -0.920339
Q 1.299748 0.331183 -0.509845 -0.903099
'''
'or' operation, use '|'
print(df[(df['D']<0) | (df['A']>0)])
'''
Output:
A B C D
P 1.668068 0.925862 1.057997 -0.920339
Q 1.299748 0.331183 -0.509845 -0.903099
R -0.130016 -2.238203 0.973165 -0.024185
'''
Resetting the index
In order to reset the index, use the method reset_index() as explained in below example,
print(df.reset_index())
'''
Output:
index A B C D
0 P 1.668068 0.925862 1.057997 -0.920339
1 Q 1.299748 0.331183 -0.509845 -0.903099
2 R -0.130016 -2.238203 0.973165 -0.024185
3 S -0.484928 -1.109264 -0.558975 1.042387
4 T -1.712263 0.136120 -0.464444 0.050980
'''
In the above example, the index is reset to numerical values and the existing indexes are reset to a column 'index'.
The method reset_index() doesn't occur in place, unless we pass an argument (inplace=True), as explained in below example,
print(df)
'''
Output:
A B C D
P 1.668068 0.925862 1.057997 -0.920339
Q 1.299748 0.331183 -0.509845 -0.903099
R -0.130016 -2.238203 0.973165 -0.024185
S -0.484928 -1.109264 -0.558975 1.042387
T -1.712263 0.136120 -0.464444 0.050980
'''
print(df.reset_index(inplace=True))
'''
Output:
index A B C D
0 P 1.668068 0.925862 1.057997 -0.920339
1 Q 1.299748 0.331183 -0.509845 -0.903099
2 R -0.130016 -2.238203 0.973165 -0.024185
3 S -0.484928 -1.109264 -0.558975 1.042387
4 T -1.712263 0.136120 -0.464444 0.050980
'''
Setting the index
Create a new column as mentioned below,
new_index = 'KA KL AP TS MH'.split() #creates a list
print(new_index)
'''
Output:
['KA', 'KL', 'AP', 'TS', 'MH']
'''
df['states'] = new_index
print(df)
'''
Output:
index A B C D states
0 P 1.668068 0.925862 1.057997 -0.920339 KA
1 Q 1.299748 0.331183 -0.509845 -0.903099 KL
2 R -0.130016 -2.238203 0.973165 -0.024185 AP
3 S -0.484928 -1.109264 -0.558975 1.042387 TS
4 T -1.712263 0.136120 -0.464444 0.050980 MH
'''
Set the index, sets in-place (cannot be reverted)
print(df.set_index('states'))
'''
Output:
index A B C D
states
KA P 1.668068 0.925862 1.057997 -0.920339
KL Q 1.299748 0.331183 -0.509845 -0.903099
AP R -0.130016 -2.238203 0.973165 -0.024185
TS S -0.484928 -1.109264 -0.558975 1.042387
MH T -1.712263 0.136120 -0.464444 0.050980
'''