Data Cleaning and Preprocessing

📘 Data Science 👁 48 views 📅 Nov 14, 2025
⏱ Estimated reading time: 1 min

Introduction

Data cleaning is the most important step in Data Science. Real-world data is messy, incomplete, and inconsistent. Clean data ensures accurate models.

1. Handling Missing Values


df.isnull().sum()
df.fillna(df.mean())     # numeric
df.fillna("Unknown")     # categorical
df.dropna()
  

2. Removing Duplicates


df.drop_duplicates(inplace=True)
  

3. Handling Outliers

Using IQR method:


Q1 = df["age"].quantile(0.25)
Q3 = df["age"].quantile(0.75)
IQR = Q3 - Q1

filtered_df = df[(df["age"] >= Q1 - 1.5*IQR) & (df["age"] <= Q3 + 1.5*IQR)]
  

4. Encoding Categorical Values

  • Label Encoding
  • One-Hot Encoding

pd.get_dummies(df["gender"])
  

5. Feature Scaling


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(df)
  

6. Normalization


from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit_transform(df)
  

Conclusion

Clean and preprocessed data leads to better accuracy and more reliable machine learning models.


🔒 Some advanced sections are available for Registered Members
Register Now

Share this Post


← Back to Tutorials

Popular Competitive Exam Quizzes