The data mining model is the most easy metric to rely upon when building or evaluating the data mining model. It’s easy, simple and single shot terminology we are so used to, from the initial days of our school days.
However as simple as it is, it can equally be misleading. The particular problem, was what I faced when building the predictive model for the “Kaggle” competition “Product Classification”. Below I illustrate a simple scenario and how the accuracy performance criteria can simple be misleading.
Accuracy Fail, specially for
- Skewed Class distribution
- If you favor one results over another e.g cancer detection, we assign high penalty to cancer misses or person of interest identification as in Enron, where we accept innocent to be added to the POI list, to missing culprit persons. The innocents ones then can be filtered out during the second round of manual investigation
Problem: Product Classification
Data set : The data set consisted of 94 attributes, with 9 possible categories for a product. And the data set consisted of extensively skewed class distribution for the product category.
As I was going on with the development of the classifier model to correctly classify the products into categories, I realized that the model I was developing was very efficient in terms of predicting the “Class_6”.
When the model was switched over the data-set, from the “Class balanced” data-set (i.e with equal number of classes) to the “Original” data-set, I observed the significant spike in the accuracy. This was rather somewhat confusing and led me to think what actually was going on.
And upon careful examination, I finally found out the issue and got to experience first hand in practice, the theoretical explanation “Why accuracy was not always the good metric to rely upon ? ”
Original Data-set : Had higher number of “Class_6” instances. And since our model was performing significantly better for “Class_6”, as the number of class_6 instances increased in the training set, the accuracy also would spike up.
More number of class_6 -> Higher accuracy
In the case, using the precision and recall made more sense. I then used precision and recall forward with higher priority on those rather than accuracy hence forth.