Data Cleaning and Preprocessing
📘 Data Science
👁 48 views
📅 Nov 14, 2025
⏱ Estimated reading time: 1 min
Introduction
Data cleaning is the most important step in Data Science. Real-world data is messy, incomplete, and inconsistent. Clean data ensures accurate models.
1. Handling Missing Values
df.isnull().sum()
df.fillna(df.mean()) # numeric
df.fillna("Unknown") # categorical
df.dropna()
2. Removing Duplicates
df.drop_duplicates(inplace=True)
3. Handling Outliers
Using IQR method:
Q1 = df["age"].quantile(0.25)
Q3 = df["age"].quantile(0.75)
IQR = Q3 - Q1
filtered_df = df[(df["age"] >= Q1 - 1.5*IQR) & (df["age"] <= Q3 + 1.5*IQR)]
4. Encoding Categorical Values
- Label Encoding
- One-Hot Encoding
pd.get_dummies(df["gender"])
5. Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df)
6. Normalization
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit_transform(df)
Conclusion
Clean and preprocessed data leads to better accuracy and more reliable machine learning models.
🔒 Some advanced sections are available for Registered Members
Register Now
Register Now
Share this Post
← Back to Tutorials