Why Training Size is important ?
Training size plays a huge role in the accuracy . If we have too few data, then our model accuracy may suffer tremendously.
The following graph will be clearly able to show the effect of the training size on the accuracy.
Fig. Showing the effect of training size on the accuracy(1)
How to find if adding more data is good ?
Strategy 1: If your accuracy is poor, then its highly likely that, adding more data will yield better results.
Strategy 2 : Since searching for more data is more trickier, one easy way will be to extrapolate by reducing the data size you have and check the performance drop. If the performance drops significantly then its highly likely that adding more data will significantly increase the accuracy. For example, if you have 800 sample size, then you can reduce your data size by 1/4th to 600 and see how it performs. In the fig below, we observe that the accuracy percentage drop is 2% from 82 to 80 %, which signals that the accuracy is leveling off. Hence addition of more data is hence is very least likely to significantly increase the accuracy.
How to know, if adding more features is good ?
If we have problem non-representative / irrelevant feature set, then no matter the addition of the data, it will not significantly yield higher accuracy. For example, to predict weight, if we only have the BMI (Body Mass Index), no matter how much we increase the data size, its very less likely to increase accuracy. Contrasting it with adding the feature “height”, the Feature addition of height will add significant accuracy to the model. One approach to find if the adding more feature is necessary might be with computing correlation of the features. (More guide strategy will be added later.)
PS : Following factors must be taken into account for Data addition Value Check, Strategy 2 i.e if adding more data will higher accuracy.
Greater assurance/ reliability , Check if the reduced data set is representative ?
However while reducing the data size, stratified samples must be ensured. For more reliability, statistical check can be performed across the data to ensure how similar the the reduced data is compared to the actual data set. For example Welch’s test (equal means, equal variance), mean , variance, standard deviation,