The Ultimate Guide to Feature Scaling for Machine Learning

Understanding Normalization and Standerdization

5 min readDec 9, 2023

Feature Scaling is a technique used to standardize the range of features in a dataset, transforming your data to make it more optimized for modelling.

Real-world datasets often contain features that vary in degrees of magnitude, range and units, ie: weight -197 pounds, distance_ran -4 miles, without scaling, your model will think the feature “weight” is generally larger than “distance_ran” and would give more emphasis to “weight” even tho “weight” and “distance_ran” are just separate unit of measurements with different magnitudes.

For machine learning models to interpret these features on the same scale we need to perform feature scaling, the purpose is to ensure that all features contribute equally to the model and to avoid the dominance of features with larger values.

And the variations in feature values can lead to biased model performance or difficulties during the learning process.

Why do we need to Scale our data

While there are a number of reasons to scale your features, the simple reason we scale our data is to have better performing models.

Reasons to perform feature scaling :

It improves model convergence: Feature scaling allows models, particularly gradient-descent based models to efficiently find optimal parameters and converge more easily.
Prevents feature dominance: This makes sure that no one feature is given more emphasis or dominates the presence of other features due to differences in unit measurements
Reduces sensitivity to outliers: Normalization reduces the sensitivity of your model to the outliers in your data.
Enhances Generalization: It helps your model generalize better, allowing it to perform well on unseen data.
Improves Algorithms Performance: It improves the performance of models such as regression models, neural networks

When Should we scale our data?

Feature Scaling is not necessary all the time for all models but only the models that are sensitive to scaling variations in the input features, the reason we do it so often is most of the popular models are sensitive to differences in scale of features like linear regression, logistic regression, due to their nature of how they optimize their parameter and learn.

Here are some algorithms based on their learning nature:

Gradient-Descent Based Algorithms: Machine Learning algorithms that use gradient descent as their optimization techniques such as Linear Regression, Logistic Regression, Neural Networks etc, require data to be scaled.
Distance-Based Algorithms: Distance-based algorithms such as k-nearest neighbour, clustering, support vector machines, etc are most affected by the range of features.
Tree-Based Algorithms: Decision Trees, Naive Bayesetc, are fairly insensitive to the scale of features, scaling isn’t always necessary.

Models that need Scaling:

Linear Models
Neural Networks
Support Vector Machines
K-Nearest Neighbor
Principal Component Analysis
Matrix Factorization

Models that may not need Scaling

Tree Models
Naive Bayes

Methods of Performing Feature Scaling

Two common methods of feature scaling are

Normalisation (Min-Max Scaling)
Standardization (Standard Scaling)

Normalization

Normalization, also known as min-max scaling is a data preprocessing technique used to adjust the values of features in a dataset to a common scale.

Here’s an example of an Unscaled Data

When normalizing features, values are shited and rescaled so that they end up ranging from 0 to 1.

When the relationship between the range of your features are meaningful and you want to preserve this information normalization is a better choice.

Values are scaled between the range of 0 and 1

# import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
norm = MinMaxScaler().fit(numerical_df_train)

# transfrom training data
df_train_norm = norm.transform(numerical_df_train)

# transform testing data
df_test_norm = norm.transform(numerical_df_test)

Standardization

Standardization is another scaling method where the values are centred around the mean with a unit standard deviation. This means that the mean of your feature becomes zero and the resultant distribution has a unit standard deviation.

Standardization is a good choice for features that have a gaussian distribution.

Mean becomes 0 with a Standard Deviation of 1

# import StandardScaler
from sklearn.preprocessing import StandardScaler

# fit scaler on training data
scale = StandardScaler().fit(df_train_stand)

# transform training data
train_stand = scale.transform(df_train_stand)

# transform testing data
test_stand = scale.transform(df_test_stand)

It is good practice to fit the scales on the training data and then use it to transform testing data. This would avoid any leakage during the model testing process.

Here are some differences between Normalization and Standerdization

    Normalization            |            Standardization
  --------------------------   -----------------------------
Rescales values to range     |   Centres data around the mean  
between 0 and 1              |   and scales to a standard deviation of one

Less sensitive to Outliers   |   Sensitive to outliers

Preserves the relationship   |   May not preserve relationship between
between data points          |   data points

Note: Scaling target values are generally not required.

Considerations:

Always consider the requirements and assumptions of the specific algorithm you are using. Some algorithms may perform well with either standardizationon or normalization, while others may be sensitive to the choice.
Experiment with both standardization and normalization, and observe the impact on your model’s performance through techniques like cross-validation.
Pay attention to outliers, as standardization can be sensitive to extreme values. In the presence of outliers, robust scalers or other data preprocessing techniques may be considered.

Some features in your dataset can be normalized while some standardized, it depends on the nature of the feature which you will have to identify and choose what scaling method to apply.

Check out the full code on my github.