The data mining model is the most easy metric to rely upon when building or evaluating the data mining model.  It’s easy, simple and single shot terminology we are so used to, from the initial days of our school days.

However as simple as it is, it can equally be misleading. The particular problem, was what I faced when  building the predictive model for the “Kaggle” competition “Product Classification”. Below I illustrate a simple scenario and how the accuracy performance criteria can  simple be misleading.

Accuracy Fail, specially for

  1. Skewed Class distribution
  2. If you favor one results over another e.g cancer detection, we  assign high penalty to cancer misses or person of interest identification as in Enron, where we accept  innocent to be added to the POI list, to missing culprit persons. The innocents ones then can be filtered out during the second round of manual investigation

Problem: Product Classification

Data set : The data set consisted of 94 attributes, with 9 possible categories for a product. And the data set  consisted of extensively skewed class distribution for the product category.

As I was going on with the development of the classifier model to correctly classify the products into categories, I realized that the model I was developing was very efficient  in terms of predicting the “Class_6”.

When the model was switched over the  data-set, from the “Class balanced” data-set (i.e with equal number of classes) to the “Original” data-set, I observed the significant spike in the accuracy.  This was rather somewhat confusing and led me to think what actually was going on.

And upon careful examination, I finally found out the issue and  got to experience first hand in practice, the theoretical explanation  “Why accuracy was not always the good metric to rely upon ? ”

Original Data-set : Had higher number of “Class_6” instances. And since our model was performing significantly better for “Class_6”, as the number of class_6 instances increased in the  training set, the accuracy also would spike up.

More number of class_6 -> Higher accuracy

In the case, using the precision and recall made more sense. I then used precision and recall forward  with higher priority on those rather than accuracy hence forth.

 

Ref: http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Advertisements