andling Missing Data in Python (Complete Guide)
Missing data is one of the most common problems in real-world datasets.
Python (especially Pandas + Scikit-Learn) provides powerful tools to:
-
Detect missing data
-
Analyze missing patterns
-
Remove missing values
-
Fill missing values (imputation)
-
Use advanced ML-based imputers
Why Data is Missing?
Typical causes:
-
User didnβt fill a form
-
System error while recording
-
Sensor failure
-
Corrupted data
-
Different data sources
Checking Missing Data
Import dataset:
import pandas as pd
df = pd.read_csv("data.csv")
β Check missing values in each column
df.isnull().sum()
β Check if any missing value exists
df.isnull().any()
β Percentage of missing values
(df.isnull().sum() / len(df)) * 100
β Visualizing missing data (Optional)
import seaborn as sns
sns.heatmap(df.isnull(), cbar=False)
Types of Missing Data
-
MCAR β Missing Completely at Random
-
MAR β Missing at Random
-
MNAR β Missing Not at Random
Handling strategy depends on type, but mostly we use Pandas/ML-based techniques.
1. Removing Missing Data
β Remove rows with ANY missing value
df.dropna(inplace=True)
β Remove rows with missing values in specific columns
df.dropna(subset=['salary', 'age'], inplace=True)
β Remove columns with many missing values
df.dropna(axis=1, thresh=0.7*len(df), inplace=True) # keep cols with >70?ta
2. Filling Missing Data (Imputation)
β 2.1 Fill with constant value
df['city'].fillna("Unknown", inplace=True)
β 2.2 Fill with mean (numeric)
df['age'].fillna(df['age'].mean(), inplace=True)
β 2.3 Fill with median (good for skewed data)
df['income'].fillna(df['income'].median(), inplace=True)
β 2.4 Fill with mode (categorical)
df['gender'].fillna(df['gender'].mode()[0], inplace=True)
β 2.5 Forward Fill (FFill)
Used for time-series data.
df.fillna(method='ffill', inplace=True)
β 2.6 Backward Fill (BFill)
df.fillna(method='bfill', inplace=True)
3. Advanced Imputation Techniques
For more accuracy, use Scikit-Learn imputers.
β 3.1 K-Nearest Neighbors (KNN Imputer)
Uses nearest rows (similar samples) to fill missing values.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
df[['age','salary']] = imputer.fit_transform(df[['age','salary']])
β 3.2 Iterative Imputer (ML-based)
Predicts missing value using other features.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer()
df[['age','income']] = imp.fit_transform(df[['age','income']])
β 3.3 SimpleImputer (basic imputation)
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
df[['age']] = imp.fit_transform(df[['age']])
Strategies:
-
mean
-
median
-
most_frequent
-
constant
4. Imputing Categorical Data (Advanced)
β Using category encoders + KNN
First convert categories to numbers:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['city'] = le.fit_transform(df['city'].astype(str))
Then apply KNN imputer.
5. Interpolation (Best for Time-Series)
df['temperature'] = df['temperature'].interpolate()
Types:
-
linear
-
polynomial
-
spline
Example:
df['sales'] = df['sales'].interpolate(method='polynomial', order=2)
6. Replace Missing Values with Group Statistics
Useful for grouped data such as city-wise, gender-wise, etc.
β Fill with group mean
df['income'] = df.groupby('city')['income'].transform(lambda x: x.fillna(x.mean()))
β Fill with group median
df['age'] = df.groupby('gender')['age'].transform(lambda x: x.fillna(x.median()))
7. Identify Missing Values Represented as Strings
Many datasets contain missing values like: "N/A", "-", "none", "empty".
Replace them first:
df.replace(['N/A', 'NA', '-', 'None', ''], pd.NA, inplace=True)
8. Drop Missing Data Rows When Impact is Small
Rule of thumb:
β Remove row when:
-
< 5 data-start="4688" data-end="4691"> β Remove column when:
-
60-70% values missing
9. Best Practices for Handling Missing Data
Always explore missing data before dropping
Use median for skewed data
Use mode for categorical data
Use KNN/Iterative imputer for better ML models
Time-series β Use forward fill, backward fill, interpolation
Avoid dropping columns unless missing > 70%
End-to-End Example
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
df = pd.read_csv("employees.csv")
# Replace string-based missing values
df.replace(['N/A','NA','None','-'], np.nan, inplace=True)
# Fill numerical values with median
df['salary'].fillna(df['salary'].median(), inplace=True)
# Fill categorical values with mode
df['city'].fillna(df['city'].mode()[0], inplace=True)
# KNN Impute for age and experience
imputer = KNNImputer(n_neighbors=3)
df[['age','experience']] = imputer.fit_transform(df[['age','experience']])
print(df.head())
Conclusion
Handling missing data is one of the most crucial steps in data preprocessing.
Python offers:
-
β¨ Simple Pandas operations
-
β¨ Time-series friendly methods
-
β¨ Advanced ML-based imputations
Proper handling of missing data leads to better analytics, better machine learning performance, and more reliable insights.
If you want, I can give you:
???? A dataset with missing values
???? A step-by-step missing data cleaning project
???? With code + explanations
Just say: βGive me a missing-data projectβ