Why Data Pre-Processing ?
Unclean Data will mean that the model will not be able to learn significant knowledge from the data and hence poor model accuracy. Hence the data needs to be cleaned prior to model training.
There exists numerous Data abnormalities that needs to be handled. Few of which are discussed below.
Denormalized Data: Data with non zero mean is referred to as the denormalized data. The denormalized data scale are specially problematic for certain algorithms. For example data between 649 and 1637. Some data mining algorithms are specially sensitive to the denormalised data. For example,the LSTM algorithms are sensitive to scale especially when the sigmoid or tanh activation functions are used.
How to Deal with it ? It is generally recommended to perform Feature scaling to 0-to-1 for such cases.
Missing Data: There are numerous ways to deal with missing data such as
- Ignore records containing the Missing Data. If too few records, then ignoring might not be an option.
- Replace missing Data with average
- Replace missing Data using some statistical methods.
Outliers : There are numerous ways to deal with the outliers. As for the outliers, we have outliers in the data for example, sold quantity 1637. Also, furthermore, in the time series data, there exists short term fluctuations which might hinder the long-term trends or cycles. To smooth out short term fluctuations and highlight long term trends, we applied the moving average for the training data over the two (selected with experimental testing) subset of the number series.  The moving average works by taking the average of the initial fixed subset of the time series data. This created a new subset of numbers which are the ones averaged over the specified number of corresponding subset. We however refrain from applying the moving average to the test data, since we want to evaluate the models true accuracy in the real data rather than over the smoothen data.