top of page
Search
  • Writer's picturenetznana

Feature Scaling

A given data set's features will differ in magnitudes and units. For example, a feature weight may vary between 80-200 pounds whereas another feature age will vary between 30-80. Now when analyzing data, this may cause issues especially if we use Euclidean distances in analyzing our data. Such methods include linear regression, k means clustering, k nearest neighbor regression etc.

Basically there are two ways of feature scaling. 1. Standardization 2. Normalization. In standardization, you scale the data to be between 0 and 1, while in normalization it is centered between 0. When to use which? That depends on the context. In general people use normalization techniques such as minmax scaler in scikit learn when dealing with image processing. The Standard scaler is the most popular standardization technique in scikit learn.

Now to a more interesting question, "my features have different magnitudes that vary. Should I scale them?" The answer is not yes always. It depends on what you want to do with the data and what features mean to you. For example,

  1. Does the proportions in features of data means something? Then DO NOT scale.

  2. Are you using non Euclidean methods like decision tress, random forest method to analyze the data? Then DO NOT scale.

In the first scenario, if the proportions of the features mean something to you, you should not scale it or you will lose this aspect of the data when scaling. In the second scenario, the impact is quite low. You may scale it but your analysis will not differ much by the scaling process. In other words the cost of scaling might not be worth the effort. This is because for example in decision trees, what you try to achieve is branching of your data and this is not quite affected by the fact that you have scaled your data. However, in a method that uses gradient descent, your algorithm will have a faster convergence if your data is scaled.


7 views0 comments

Recent Posts

See All

Reusability of codes

So I am working on my first project in analyzing a dataset using pandas dataframe and wrote a bunch of codes to carry this out. However,...

Useful panda dataframe functions

So recently I happened to code a function for a dataframe and it took me almost 4 hours to get it done. The pseudo code was simple, but I...

Comments


bottom of page