Why Training Size is important ?

Training size plays a huge role in the  accuracy . If we have too few data, then  our  model accuracy may suffer tremendously.

The following graph will be clearly able to show the effect of the training size on the accuracy.


accuracy vs Trainign size.jpg

Fig. Showing  the effect of training size on the accuracy(1)

How to find  if adding more data is good ?

Strategy 1: If your accuracy is poor, then its  highly likely that, adding more data will yield better results.

Strategy 2 : Since searching for more data is more trickier, one easy way will be to extrapolate by reducing the  data size you have and check   the performance drop. If the performance drops significantly then its highly likely that adding more data will significantly increase the accuracy. For example, if you have 800 sample size, then you can  reduce your data size by 1/4th to 600 and see how it performs. In the fig below, we  observe that the  accuracy percentage drop is 2% from 82 to 80 %, which signals that the accuracy is leveling off. Hence addition of more data is hence is very least likely to significantly increase the accuracy.

How to know, if adding more features is  good ?

If we have  problem non-representative / irrelevant feature set, then no matter the addition of the data, it will not significantly yield higher accuracy. For example, to   predict weight, if we  only have the BMI (Body Mass Index), no matter how much we increase the data size, its very less likely to increase accuracy.  Contrasting it with adding the feature “height”,  the Feature addition of height will add significant accuracy to the model. One approach to find if the adding more feature is necessary might be with computing correlation of the  features. (More guide strategy will be added later.)

PS : Following factors must be taken into account for  Data addition Value Check, Strategy 2 i.e if adding more data will higher accuracy.

Greater assurance/ reliability , Check if the reduced data set is representative ?

However while reducing the data size, stratified  samples  must be ensured. For more reliability,  statistical check can be performed across the data to ensure how similar the the reduced data  is compared to the actual data set. For example Welch’s test (equal means, equal variance), mean , variance, standard deviation,


Reference :

  1. https://www.youtube.com/watch?v=9w1Yi5nMNgw (02:45)