Home »
Python »
Python Programs
How to remove outliers in Python?
By Shivang Yadav Last updated : November 21, 2023
Outliers in Python
Outliers in Python are data observations that lie significantly away from the rest of the datasets. These outliers are the values that are caused by some error in the program or data feeders. These are needed to be processed before performing any operation on the data because these can affect the accuracy of the analysis.
Methods to identify outliers
These outliers present in the dataset first needed to be identified for processing them. Finding the values that could go outside the desired range of values is then eliminating them so that the analysis to be done on the data is more accurate.
Interquartile Range method
IQR (InterQuartile Range) method is an outlier identification method. IQR is the difference between the 75th percentile(Q3) and 25th percentile(Q1) in a dataset. The value outside the 1.5X of the IQR range is the outlier.
Z-score method
Z-score is one more statistical method that can be used for the identification of Outliers. A value in the dataset that is far away from the Z-score is treated as an outliner.
How to Remove Outliers in Python?
Once identified, outliners need to be removed to make sure that the data to be processed is more precise to make the result better.
Z-score Method
The Z-score for the value of the dataset can be made a measure to remove outliers. Removing outliers from a dataset using the Z-score method is done by marking values of the -3 to +3 range for Z-scores.
Outliners: observations with Z-score value outside the -3 to 3 range.
Z-score is a more sensitive method which means only extreme outliers will be deleted.
Program to illustrate the removal of outliers in Python using Z-score
import numpy as np
import pandas as pd
import scipy.stats as stats
array = np.array(
[
[0.315865, 0.152790, -0.454003],
[-0.083838, 0.213360, -0.200856],
[0.655116, 0.085485, 0.042914],
[14845370, -10798049, -19777283],
[0.243121, 0.32123, -0.454993],
[-0.088338, 0.213364, -0.211856],
[0.165511, 0.085485, 0.042914],
[14845370, -10798055, -19777183],
]
)
index_values = [1, 2, 3, 4, 5, 6, 7, 8]
column_values = ["A", "B", "C"]
dataValues = pd.DataFrame(data=array, index=index_values, columns=column_values)
print(f"The dataset is \n{dataValues}")
zScore = np.abs(stats.zscore(dataValues))
data_clean = dataValues[(zScore < 3).all(axis=1)]
print(f"Value count in dataSet after removing outliers is \n{data_clean.shape}")
The output of the above program is:
The dataset is
A B C
1 3.158650e-01 1.527900e-01 -4.540030e-01
2 -8.383800e-02 2.133600e-01 -2.008560e-01
3 6.551160e-01 8.548500e-02 4.291400e-02
4 1.484537e+07 -1.079805e+07 -1.977728e+07
5 2.431210e-01 3.212300e-01 -4.549930e-01
6 -8.833800e-02 2.133640e-01 -2.118560e-01
7 1.655110e-01 8.548500e-02 4.291400e-02
8 1.484537e+07 -1.079806e+07 -1.977718e+07
Value count in dataSet after removing outliers is
(8, 3)
Interquartile Range Method
IQR is the difference between 75th percentile(Q3) and 25th percentile(Q1) in a dataset. The value outside the 1.5X of the IQR range is the outlier.
Program to illustrate the removing of outliers in Python using Interquartile Range method
import numpy as np
import pandas as pd
import scipy.stats as stats
array = np.array(
[
[0.315865, 0.152790, -0.454003],
[-0.083838, 0.213360, -0.200856],
[0.655116, 0.085485, 0.042914],
[14845370, -10798049, -19777283],
[0.243121, 0.32123, -0.454993],
[-0.088338, 0.213364, -0.211856],
[0.165511, 0.085485, 0.042914],
[14845370, -10798055, -19777183],
]
)
index_values = [1, 2, 3, 4, 5, 6, 7, 8]
column_values = ["A", "B", "C"]
dataValues = pd.DataFrame(data=array, index=index_values, columns=column_values)
print(f"The dataset is \n{dataValues}")
Q1 = dataValues.quantile(q=0.25)
Q3 = dataValues.quantile(q=0.75)
IQR = dataValues.apply(stats.iqr)
data_clean = dataValues[
~((dataValues < (Q1 - 1.5 * IQR)) | (dataValues > (Q3 + 1.5 * IQR))).any(axis=1)
]
print(f"Value count in dataSet after removing outliers is \n{data_clean.shape}")
The output of the above program is:
The dataset is
A B C
1 3.158650e-01 1.527900e-01 -4.540030e-01
2 -8.383800e-02 2.133600e-01 -2.008560e-01
3 6.551160e-01 8.548500e-02 4.291400e-02
4 1.484537e+07 -1.079805e+07 -1.977728e+07
5 2.431210e-01 3.212300e-01 -4.549930e-01
6 -8.833800e-02 2.133640e-01 -2.118560e-01
7 1.655110e-01 8.548500e-02 4.291400e-02
8 1.484537e+07 -1.079806e+07 -1.977718e+07
Value count in dataSet after removing outliers is
(6, 3)
Python Pandas Programs »