Feature Engineering Techniques

πŸ“˜ Python for Data Science πŸ‘ 61 views πŸ“… Nov 14, 2025
⏱ Estimated reading time: 4 min

Feature Engineering is the process of transforming raw data into meaningful input features that help Machine Learning models perform better.

It is one of the MOST important skills in ML because:

  • Better features β†’ Better model accuracy

  • Helps extract patterns hidden in the data

  • Reduces noise, improves training efficiency

  • Required for real-world dirty datasets


Types of Feature Engineering

Feature engineering broadly contains:

  1. Handling Missing Values

  2. Encoding Categorical Variables

  3. Scaling & Normalization

  4. Feature Creation (New Features)

  5. Transformation of Variables

  6. Feature Extraction

  7. Dimensionality Reduction

  8. Outlier Handling

  9. Datetime Feature Engineering

  10. Text Feature Engineering

  11. Feature Selection Techniques

Now let’s explore each in detail ????


1. Handling Missing Values

βœ” Mean/Median Imputation

df['age'].fillna(df['age'].mean(), inplace=True)

βœ” Mode Imputation (categorical)

df['city'].fillna(df['city'].mode()[0], inplace=True)

βœ” Constant Imputation

df.fillna("Unknown", inplace=True)

βœ” Advanced: KNN Imputer

from sklearn.impute import KNNImputer imputer = KNNImputer(n_neighbors=5) df[['age','salary']] = imputer.fit_transform(df[['age','salary']])

2. Encoding Categorical Variables

βœ” One-Hot Encoding

pd.get_dummies(df, columns=['gender'], drop_first=True)

βœ” Label Encoding

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['gender'] = le.fit_transform(df['gender'])

βœ” Ordinal Encoding (for ordered categories)

df['education'] = df['education'].map({"High School":1, "Bachelor":2, "Master":3, "PhD":4})

βœ” Target Encoding (advanced)

Replace categories with mean of target.


3. Scaling & Normalization

βœ” Standardization (Z-Score)

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df[['age','salary']] = scaler.fit_transform(df[['age','salary']])

βœ” Min-Max Scaling

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() df[['price']] = scaler.fit_transform(df[['price']])

βœ” Robust Scaling (good for outliers)

from sklearn.preprocessing import RobustScaler rs = RobustScaler() df[['income']] = rs.fit_transform(df[['income']])

4. Creating New Features

βœ” Mathematical Features

df['bmi'] = df['weight'] / (df['height']/100)**2

βœ” Interaction Features (Feature crossing)

df['income_per_age'] = df['income'] / df['age'] df['area'] = df['length'] * df['width']

βœ” Polynomial Features

from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2) poly_features = poly.fit_transform(df[['area']])

5. Transformation of Variables

βœ” Log Transformation (for skewed data)

df['salary_log'] = np.log(df['salary'] + 1)

βœ” Square Root

df['sqrt_amount'] = np.sqrt(df['amount'])

βœ” Box-Cox Transformation

from scipy.stats import boxcox df['box'], _ = boxcox(df['variable'] + 1)

6. Feature Extraction

βœ” From Text using TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer() X = tfidf.fit_transform(df['review'])

βœ” From Images (using deep learning)

  • CNN features (ResNet, VGG, MobileNet)

  • Pretrained embeddings


7. Dimensionality Reduction

βœ” PCA (Principal Component Analysis)

from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(df)

βœ” t-SNE (for visualization)

from sklearn.manifold import TSNE tsne = TSNE(n_components=2) X_tsne = tsne.fit_transform(df)

8. Handling Outliers

βœ” IQR Method

Q1 = df['price'].quantile(0.25) Q3 = df['price'].quantile(0.75) IQR = Q3 - Q1 df = df[(df['price'] > Q1 - 1.5*IQR) & (df['price'] < Q3>1.5*IQR)]

βœ” Capping Outliers

df['income'] = df['income'].clip(lower=df['income'].quantile(0.01), upper=df['income'].quantile(0.99))

9. Date-Time Feature Engineering

Assume column: date

df['date'] = pd.to_datetime(df['date'])

βœ” Extract:

df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day df['weekday'] = df['date'].dt.dayofweek df['is_weekend'] = (df['weekday'] >= 5).astype(int)

βœ” Time difference

df['days_since_signup'] = (pd.Timestamp.today() - df['signup_date']).dt.days

10. Text Feature Engineering

βœ” Token Count

df['word_count'] = df['review'].apply(lambda x: len(x.split()))

βœ” Sentiment Polarity

from textblob import TextBlob df['sentiment'] = df['review'].apply(lambda x: TextBlob(x).sentiment.polarity)

βœ” Remove stopwords

import re import nltk from nltk.corpus import stopwords stop = set(stopwords.words('english')) df['clean_text'] = df['review'].apply(lambda x: " ".join([w for w in x.split() if w.lower() not in stop]))

11. Feature Selection Techniques

βœ” Filter Methods

from sklearn.feature_selection import SelectKBest, chi2 best = SelectKBest(score_func=chi2, k=5) X_new = best.fit_transform(X, y)

βœ” Wrapper Method (RFE)

from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression model = LogisticRegression() rfe = RFE(model, 5) fit = rfe.fit(X, y)

βœ” Embedded Methods (Lasso)

from sklearn.linear_model import Lasso lasso = Lasso(alpha=0.01) lasso.fit(X, y)

End-to-End Feature Engineering Example

import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler, LabelEncoder df = pd.read_csv("employees.csv") # Missing values df['salary'].fillna(df['salary'].mean(), inplace=True) # Encoding le = LabelEncoder() df['gender'] = le.fit_transform(df['gender']) # Date-time df['hire_date'] = pd.to_datetime(df['hire_date']) df['experience_years'] = (pd.Timestamp.today() - df['hire_date']).dt.days / 365 # Feature creation df['income_per_age'] = df['salary'] / df['age'] # Scaling scaler = StandardScaler() df[['salary_scaled']] = scaler.fit_transform(df[['salary']]) print(df.head())

Conclusion

Feature Engineering is the heart of Machine Learning.

It improves:

βœ” Model accuracy
βœ” Data quality
βœ” Training speed
βœ” Pattern extraction

It includes:

  • cleaning

  • creating

  • encoding

  • scaling

  • transforming

  • selecting

  • extracting

Good ML models are built not by algorithms but by strong features.


πŸ”’ Some advanced sections are available for Registered Members
Register Now

Share this Post


← Back to Tutorials

Popular Competitive Exam Quizzes