Home »
Python »
Python Programs
How to split data into 3 sets (train, validation and test)?
NumPy | Split data 3 sets (train, validation, and test): In this tutorial, we will learn how to split your given data (dataset) into 3 sets - training, validation, and testing set with the help of the Python NumPy program.
By Pranit Sharma Last updated : June 04, 2023
Problem Statement
Given a data / dataset/ DataFrame, we have to split this data into 3 sets (training, validation, and testing).
Solution approach
We know that while creating a machine learning model or designing any machine learning algorithm, we usually split the data into three sets i.e., the training set, the validation set, and the testing set.
The composition of all the sets is also predefined by the user, usually, 60% of data is used for the training set. Validation and testing sets are composed of 20% each.
How to split data into 3 sets (train, validation and test)?
To split the data into three sets, create a DataFrame having the overall data and then use the numpy.split() method by specifying the size (or, percentage) of the data that you want for the particular sets.
Let us understand with the help of an example,
Python program to split data into 3 sets (train, validation, and test)
# Import numpy
import numpy as np
# Import pandas
import pandas as pd
# Creating a dataframe
df = pd.DataFrame(np.random.rand(10, 5), columns=list("ABCDE"))
# Settings maximum rows and columns
# to display/print all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Display original dataframe
print("Original DataFrame:\n", df, "\n")
# Splitting the data into 3 parts
train, test, validate = np.split(
df.sample(frac=1, random_state=42), [int(0.6 * len(df)), int(0.8 * len(df))]
)
# Display different sets
print("Training set:\n", train, "\n")
print("Testing set:\n", test, "\n")
print("Validation set:\n", validate)
Output
Original DataFrame:
A B C D E
0 0.062043 0.305778 0.040534 0.344276 0.060514
1 0.705843 0.609687 0.070329 0.021927 0.714339
2 0.703366 0.613181 0.384509 0.005025 0.030347
3 0.627445 0.716861 0.802043 0.330570 0.479814
4 0.415682 0.620594 0.704717 0.606593 0.071703
5 0.508037 0.361807 0.904131 0.643761 0.824738
6 0.628795 0.163949 0.072226 0.984469 0.174503
7 0.338267 0.510505 0.608846 0.166929 0.657149
8 0.346381 0.082333 0.947476 0.812816 0.962484
9 0.979881 0.538592 0.433578 0.886863 0.468531
Training set:
A B C D E
8 0.346381 0.082333 0.947476 0.812816 0.962484
1 0.705843 0.609687 0.070329 0.021927 0.714339
5 0.508037 0.361807 0.904131 0.643761 0.824738
0 0.062043 0.305778 0.040534 0.344276 0.060514
7 0.338267 0.510505 0.608846 0.166929 0.657149
2 0.703366 0.613181 0.384509 0.005025 0.030347
Testing set:
A B C D E
9 0.979881 0.538592 0.433578 0.886863 0.468531
4 0.415682 0.620594 0.704717 0.606593 0.071703
Validation set:
A B C D E
3 0.627445 0.716861 0.802043 0.330570 0.479814
6 0.628795 0.163949 0.072226 0.984469 0.174503
Python NumPy Programs »