Data Cleaning and Preprocessing

📘 Data Science 👁 91 views 📅 Nov 14, 2025
⏱ Estimated reading time: 1 min

Introduction

Data cleaning is the most important step in Data Science. Real-world data is messy, incomplete, and inconsistent. Clean data ensures accurate models.

1. Handling Missing Values


df.isnull().sum()
df.fillna(df.mean())     # numeric
df.fillna("Unknown")     # categorical
df.dropna()
  

2. Removing Duplicates


df.drop_duplicates(inplace=True)
  

3. Handling Outliers

Using IQR method:


Q1 = df["age"].quantile(0.25)
Q3 = df["age"].quantile(0.75)
IQR = Q3 - Q1

filtered_df = df[(df["age"] >= Q1 - 1.5*IQR) & (df["age"] <= Q3 + 1.5*IQR)]
  

4. Encoding Categorical Values

  • Label Encoding
  • One-Hot Encoding

pd.get_dummies(df["gender"])
  

5. Feature Scaling


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(df)
  

6. Normalization


from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit_transform(df)
  

Conclusion

Clean and preprocessed data leads to better accuracy and more reliable machine learning models.


🔒 Some advanced sections are available for Registered Members
Register Now

Share this Post


← Back to Tutorials

Popular Competitive Exam Quizzes

🤖 AI Quizer Assistant

📝 Quiz
📚 Categories
🏆 Leaderboard
📊 My Score
❓ Help
👋 Hi! I'm your AI quiz assistant for Quizer.in!

I can help you with:
• 📝 Finding quizzes
• 🏆 Checking leaderboard
• 📊 Your performance stats

Type 'help' to get started! 🚀
AI is thinking...