Home »
Machine Learning/Artificial Intelligence
Training and Test Sets: Splitting Data using Java in Machine Learning
In this tutorial, we will learn how can we perform cross-validation the given data set and then split out data into training and testing sets?
By Raunak Goswami Last updated : April 17, 2023
Prerequisite
Training Set
The purpose of using the training set is as the name suggests is to train our model by feeding in the attributes and the corresponding target value into using the values in the training our model can identify a pattern which will be used by our model to predict the test set values.
Test Set
This set is used to check the accuracy of our model and as the name suggest we use this dataset to perform the testing of our result. This data set usually contains the independent attributes using which our model predicts the dependent value or the target value. Using the predicted target values we further compare those values with the predefined set of the target values in our test set in order to determine the various evaluating parameters like RMSE,percentage accuracy, percentage error, area under the curve to determine the efficiency of our model in predicting the dependent values which in turn determines the usefulness of our model.
For detailed information about training and test set, you can refer to my article about data splitting.
Another important feature that we are going to talk about is the cross-validation. Well, in order to increase the accuracy of our model we use cross-validation. Suppose if we split our data in such a way that we have 100 set of values and we split first 20 as testing sets and rest as the training sets, well since we need more data for training the splitting ratio we used here is completely fine but then there arise many uncertainties like what if the first 20 sets of data have completely opposite values from the rest of data one way to sort this issue is to use a random function which will randomly select the testing and training set values so now we have reduced chances of getting biased set of values into our training and test sets but still we have not fully sorted the problem there are still chances that maybe the randomized testing data set has the values which aren’t at all related to the training set values or it might be that the values in the test set are exactly the same as that of training set which will result in overfitting of our model ,you can refer to this article if you want to know more about overfitting and underfitting of the data.
Well, then how do we solve this issue? One way is to split the data n times into training and testing sets and then find the average of those splitting datasets to create the best possible set for training and testing. But everything comes with a cost since we are repeatedly splitting out data into training and testing the process of cross-validation consumes some time. But then it is worth waiting if we can get a more accurate result.
Image source: https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg
While writing the code I would be using a variable named as fold or K as shown in the above figure which signifies the no of times to perform the cross-validation.
Java code for generating training and test sets in the ratio of 1:4 (approx.)
Below is the java code is written for generating testing and training sets in the ratio of 1:4(approx.) which is an optimal ratio of splitting the data sets.
The data set I have used can be copied from here: File name: "headbraina.arff"
@relation headbrain-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R1
@attribute 'Head Size(cm^3)' numeric
@attribute 'Brain Weight(grams)' numeric
@data
4512,1530
3738,1297
4261,1335
3777,1282
4177,1590
3585,1300
3785,1400
3559,1255
3613,1355
3982,1375
3443,1340
3993,1380
3640,1355
4208,1522
3832,1208
3876,1405
3497,1358
3466,1292
3095,1340
4424,1400
3878,1357
4046,1287
3804,1275
3710,1270
4747,1635
4423,1505
4036,1490
4022,1485
3454,1310
4175,1420
3787,1318
3796,1432
4103,1364
4161,1405
4158,1432
3814,1207
3527,1375
3748,1350
3334,1236
3492,1250
3962,1350
3505,1320
4315,1525
3804,1570
3863,1340
4034,1422
4308,1506
3165,1215
3641,1311
3644,1300
3891,1224
3793,1350
4270,1335
4063,1390
4012,1400
3458,1225
3890,1310
4166,1560
3935,1330
3669,1222
3866,1415
3393,1175
4442,1330
4253,1485
3727,1470
3329,1135
3415,1310
3372,1154
4430,1510
4381,1415
4008,1468
3858,1390
4121,1380
4057,1432
3824,1240
3394,1195
3558,1225
3362,1188
3930,1252
3835,1315
3830,1245
3856,1430
3249,1279
3577,1245
3933,1309
3850,1412
3309,1120
3406,1220
3506,1280
3907,1440
4160,1370
3318,1192
3662,1230
3899,1346
3700,1290
3779,1165
3473,1240
3490,1132
3654,1242
3478,1270
3495,1218
3834,1430
3876,1588
3661,1320
3618,1290
3648,1260
4032,1425
3399,1226
3916,1360
4430,1620
3695,1310
3524,1250
3571,1295
3594,1290
3383,1290
3499,1275
3589,1250
3900,1270
4114,1362
3937,1300
3399,1173
4200,1256
4488,1440
3614,1180
4051,1306
3782,1350
3391,1125
3124,1165
4053,1312
3582,1300
3666,1270
3532,1335
4046,1450
3667,1310
2857,1027
3436,1235
3791,1260
3302,1165
3104,1080
3171,1127
3572,1270
3530,1252
3175,1200
3438,1290
3903,1334
3899,1380
3401,1140
3267,1243
3451,1340
3090,1168
3413,1322
3323,1249
3680,1321
3439,1192
3853,1373
3156,1170
3279,1265
3707,1235
4006,1302
3269,1241
3071,1078
3779,1520
3548,1460
3292,1075
3497,1280
3082,1180
3248,1250
3358,1190
3803,1374
3566,1306
3145,1202
3503,1240
3571,1316
3724,1280
3615,1350
3203,1180
3609,1210
3561,1127
3979,1324
3533,1210
3689,1290
3158,1100
4005,1280
3181,1175
3479,1160
3642,1205
3632,1163
3069,1022
3394,1243
3703,1350
3165,1237
3354,1204
3000,1090
3687,1355
3556,1250
2773,1076
3058,1120
3344,1220
3493,1240
3297,1220
3360,1095
3228,1235
3277,1105
3851,1405
3067,1150
3692,1305
3402,1220
3995,1296
3318,1175
2720,955
2937,1070
3580,1320
2939,1060
2989,1130
3586,1250
3156,1225
3246,1180
3170,1178
3268,1142
3389,1130
3381,1185
2864,1012
3740,1280
3479,1103
3647,1408
3716,1300
3284,1246
4204,1380
3735,1350
3218,1060
3685,1350
3704,1220
3214,1110
3394,1215
3233,1104
3352,1170
3391,1120
Code
import weka.core.Instances;
import java.io.File;
import java.util.Random;
import weka.core.converters.ArffSaver;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.Evaluation;
import weka.classifiers.bayes.NaiveBayes;
public class testtrainjaava {
public static void main(String args[]) throws Exception {
//load dataset
DataSource source = new DataSource("headbraina.arff");
Instances dataset = source.getDataSet();
//set class index to the last attribute
dataset.setClassIndex(dataset.numAttributes() - 1);
int seed = 1;
int folds = 15;
//randomize data
Random rand = new Random(seed);
//create random dataset
Instances randData = new Instances(dataset);
randData.randomize(rand);
//stratify
if (randData.classAttribute().isNominal())
randData.stratify(folds);
// perform cross-validation
for (int n = 0; n < folds; n++) {
//Evaluation eval = new Evaluation(randData);
//get the folds
Instances train = randData.trainCV(folds, n);
Instances test = randData.testCV(folds, n);
ArffSaver saver = new ArffSaver();
saver.setInstances(train);
System.out.println("No of folds done = " + (n + 1));
saver.setFile(new File("trainheadbraina.arff"));
saver.writeBatch();
//if(n==9)
//{System.out.println("Training set generated after the final fold is");
//System.out.println(train);}
ArffSaver saver1 = new ArffSaver();
saver1.setInstances(test);
saver1.setFile(new File("testheadbraina1.arff"));
saver1.writeBatch();
}
}
}
Output
After getting this output just go to the destination folder in which you have to save the training and testing data sets and you should see the following results.
Dataset generated for training the model
Dataset generated for testing the model