Handling Missing Data in Python

πŸ“˜ Python for Data Science πŸ‘ 58 views πŸ“… Nov 14, 2025
⏱ Estimated reading time: 4 min

andling Missing Data in Python (Complete Guide)

Missing data is one of the most common problems in real-world datasets.

Python (especially Pandas + Scikit-Learn) provides powerful tools to:

  • Detect missing data

  • Analyze missing patterns

  • Remove missing values

  • Fill missing values (imputation)

  • Use advanced ML-based imputers


Why Data is Missing?

Typical causes:

  • User didn’t fill a form

  • System error while recording

  • Sensor failure

  • Corrupted data

  • Different data sources


Checking Missing Data

Import dataset:

import pandas as pd df = pd.read_csv("data.csv")

βœ” Check missing values in each column

df.isnull().sum()

βœ” Check if any missing value exists

df.isnull().any()

βœ” Percentage of missing values

(df.isnull().sum() / len(df)) * 100

βœ” Visualizing missing data (Optional)

import seaborn as sns sns.heatmap(df.isnull(), cbar=False)

Types of Missing Data

  1. MCAR – Missing Completely at Random

  2. MAR – Missing at Random

  3. MNAR – Missing Not at Random

Handling strategy depends on type, but mostly we use Pandas/ML-based techniques.


1. Removing Missing Data

βœ” Remove rows with ANY missing value

df.dropna(inplace=True)

βœ” Remove rows with missing values in specific columns

df.dropna(subset=['salary', 'age'], inplace=True)

βœ” Remove columns with many missing values

df.dropna(axis=1, thresh=0.7*len(df), inplace=True) # keep cols with >70?ta

2. Filling Missing Data (Imputation)

βœ” 2.1 Fill with constant value

df['city'].fillna("Unknown", inplace=True)

βœ” 2.2 Fill with mean (numeric)

df['age'].fillna(df['age'].mean(), inplace=True)

βœ” 2.3 Fill with median (good for skewed data)

df['income'].fillna(df['income'].median(), inplace=True)

βœ” 2.4 Fill with mode (categorical)

df['gender'].fillna(df['gender'].mode()[0], inplace=True)

βœ” 2.5 Forward Fill (FFill)

Used for time-series data.

df.fillna(method='ffill', inplace=True)

βœ” 2.6 Backward Fill (BFill)

df.fillna(method='bfill', inplace=True)

3. Advanced Imputation Techniques

For more accuracy, use Scikit-Learn imputers.


βœ” 3.1 K-Nearest Neighbors (KNN Imputer)

Uses nearest rows (similar samples) to fill missing values.

from sklearn.impute import KNNImputer imputer = KNNImputer(n_neighbors=3) df[['age','salary']] = imputer.fit_transform(df[['age','salary']])

βœ” 3.2 Iterative Imputer (ML-based)

Predicts missing value using other features.

from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer imp = IterativeImputer() df[['age','income']] = imp.fit_transform(df[['age','income']])

βœ” 3.3 SimpleImputer (basic imputation)

from sklearn.impute import SimpleImputer imp = SimpleImputer(strategy='mean') df[['age']] = imp.fit_transform(df[['age']])

Strategies:

  • mean

  • median

  • most_frequent

  • constant


4. Imputing Categorical Data (Advanced)

βœ” Using category encoders + KNN

First convert categories to numbers:

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['city'] = le.fit_transform(df['city'].astype(str))

Then apply KNN imputer.


5. Interpolation (Best for Time-Series)

df['temperature'] = df['temperature'].interpolate()

Types:

  • linear

  • polynomial

  • spline

Example:

df['sales'] = df['sales'].interpolate(method='polynomial', order=2)

6. Replace Missing Values with Group Statistics

Useful for grouped data such as city-wise, gender-wise, etc.

βœ” Fill with group mean

df['income'] = df.groupby('city')['income'].transform(lambda x: x.fillna(x.mean()))

βœ” Fill with group median

df['age'] = df.groupby('gender')['age'].transform(lambda x: x.fillna(x.median()))

7. Identify Missing Values Represented as Strings

Many datasets contain missing values like: "N/A", "-", "none", "empty".

Replace them first:

df.replace(['N/A', 'NA', '-', 'None', ''], pd.NA, inplace=True)

8. Drop Missing Data Rows When Impact is Small

Rule of thumb:

βœ” Remove row when:

  • < 5 data-start="4688" data-end="4691"> βœ” Remove column when:

  • 60-70% values missing


9. Best Practices for Handling Missing Data

Always explore missing data before dropping
Use median for skewed data
Use mode for categorical data
Use KNN/Iterative imputer for better ML models
Time-series β†’ Use forward fill, backward fill, interpolation
Avoid dropping columns unless missing > 70%


End-to-End Example

import pandas as pd import numpy as np from sklearn.impute import KNNImputer df = pd.read_csv("employees.csv") # Replace string-based missing values df.replace(['N/A','NA','None','-'], np.nan, inplace=True) # Fill numerical values with median df['salary'].fillna(df['salary'].median(), inplace=True) # Fill categorical values with mode df['city'].fillna(df['city'].mode()[0], inplace=True) # KNN Impute for age and experience imputer = KNNImputer(n_neighbors=3) df[['age','experience']] = imputer.fit_transform(df[['age','experience']]) print(df.head())

Conclusion

Handling missing data is one of the most crucial steps in data preprocessing.
Python offers:

  • ✨ Simple Pandas operations

  • ✨ Time-series friendly methods

  • ✨ Advanced ML-based imputations

Proper handling of missing data leads to better analytics, better machine learning performance, and more reliable insights.


If you want, I can give you:

???? A dataset with missing values
???? A step-by-step missing data cleaning project
???? With code + explanations

Just say: β€œGive me a missing-data project”


πŸ”’ Some advanced sections are available for Registered Members
Register Now

Share this Post


← Back to Tutorials

Popular Competitive Exam Quizzes