Three pillars of Data Preprocessing

Νίκος Τσακίρης
Analytics Vidhya
Published in
4 min readFeb 26, 2021

--

Many of you might be familiar with ‘curse of dimensionality’, a term that describes the problems being risen from large amounts of attributes, which translates into a quite big number of dimensions. As a general rule, dimensions should be reduced to the most efficient minimum they can reach, so that the computational cost remains under control, while the extracted information is sufficient for the problem at hand. Elaborating on the previous argument, one should consider that two features might provide sufficient information per sole, yet cease to offer anything significant when combined due to high mutual correlation, thus the reasoning behind carefully selecting which attributes should be examined and in which correlational manner, so that no unnecessary complexity is produced.

The greater the factor of the number of training samples with respect to the free parameters (e. g. synaptic weights), the better the classifier’s ability to generalize.

Datasets with many features result in large amounts of free parameters, and this can easily become a problem when the samples N are of a small multitude. Hence, it is wise for this particular situation to keep a low number of features, so that the aforementioned statement is fulfilled.

Keep the denominator low!

Subsequently, it’s normal to ask for the best candidates between the features, and this is where feature reduction (or feature selection) comes into play. Before we proceed to delve into the three main procedures, I’d like to add another dimension to the subject, no pun intended. Frankly, a good way to properly articulate the exact proposition of preproccessing is this:

The main objective is to select those attributes so that the between-class distance is large, while the within-class variance is small.

Outlier Removal

An outlier is considered any data point which is too far from the mean value of its corresponding random variable. Generally speaking, those points produce terrible results and don’t have any significant value to offer to the training procedure, therefore there are certain measures to be taken in order to avoid that:

If their number is low, we can get rid of them.

Else, the engineer should choose cost functions which are immune to those outliers.

For example, Least Square Error is not really that immune, since squaring the outliers will result in even larger errors, thus their dominance inside the cost function.

Data Normalization

Variability in a dataset equals different feature scales, since one or many of them might adhere to different scaling ranges, and such concept quickly translates into dominance of larger values inside the cost function compared to those with a smaller bandwidth, as stated previously. Therefore, the most normal step to take is to proceed to flatten out the total range of all values so that they manifest themselves into a single common reference point. Correspondingly, this means to establish a normal distribution, where the mean of datapoints will be 0 and standard deviation equal to 1.

Mean
Standard Deviation
Normalize the Data Points

It should be worth it to note that this procedure is one among the other linear ones, just like the [-1, 1] scaling. Subsequently, one can make use of non-linear scaling functions like Softmax.

Missing Data

Since we are talking about real world solutions, it’s normal to expect a percentage of the total data to be missing, a phenomenon well observed in social sciences or prognostic medical datasets. So, what’s that we do? Essentially, the solution is named Imputation, and can be divided into three possible candidates:

Replace missing values with zeroes,

replace them with a conditional mean value {E (missing | observed)}, or

replace missing values with a the non-conditional mean (calculated via the available observed values).

Of course, a simple way of dealing with this is by simply getting rid of them, but this might cause problems when the dataset is not sufficiently large in order to proceed to such drastic measures, thus the reduction of the resulted extracted information.

Conclusion

Data is king, and will continue to grow exponentially, and it generally requires regulations and proper examination before using it to train our machines. Consequently, we should know our tools and make the most out of it, whether it’s pre-processing, collecting, or simply…observing.

--

--