Data Wrangling Using Python

📘 Python for Data Science 👁 53 views 📅 Nov 14, 2025
⏱ Estimated reading time: 4 min

Data wrangling means transforming raw, messy, incomplete data into clean, usable form for analysis and machine learning.

It includes:

  • Data collection

  • Data cleaning

  • Data transformation

  • Data integration

  • Data reduction

  • Feature engineering

The main Python libraries used are:

  • Pandas → for data manipulation

  • NumPy → for numerical operations

  • Matplotlib/Seaborn → for visualization

  • Regex → for text cleaning

  • SQL → for structured storage


???? 1. Import Required Libraries

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns

???? 2. Loading Raw Data

Real-world data comes in different formats.

✔ CSV File

df = pd.read_csv("data.csv")

✔ Excel File

df = pd.read_excel("data.xlsx")

✔ JSON File

df = pd.read_json("data.json")

✔ Database (SQL)

import sqlite3 con = sqlite3.connect("db.sqlite") df = pd.read_sql("SELECT * FROM customers", con)

???? 3. Understanding the Data (Exploration)

Use these methods to inspect the raw dataset:

df.head() df.tail() df.info() df.describe() df.shape df.dtypes df.isnull().sum()

This helps identify:

  • Missing data

  • Incorrect types

  • Outliers

  • Incorrect formatting

  • Duplicates


???? 4. Data Cleaning

✔ 4.1 Handling Missing Values

Find missing values

df.isnull().sum()

Remove rows with missing values

df.dropna(inplace=True)

Fill missing values

df['age'].fillna(df['age'].mean(), inplace=True) df['city'].fillna("Unknown", inplace=True) df.fillna(0, inplace=True)

✔ 4.2 Handling Duplicates

Find duplicates

df.duplicated().sum()

Remove duplicates

df.drop_duplicates(inplace=True)

✔ 4.3 Cleaning Inconsistent Data

Standardize text

df['name'] = df['name'].str.strip().str.title()

Remove special characters (Regex)

df['phone'] = df['phone'].str.replace('[^0-9]', '', regex=True)

Fix inconsistent categories

df['gender'] = df['gender'].str.lower().str.replace("female", "f").str.replace("male", "m")

???? 5. Data Transformation

Data transformation improves data quality and prepares it for analysis.

✔ 5.1 Convert Data Types

df['date'] = pd.to_datetime(df['date']) df['price'] = df['price'].astype(float) df['category'] = df['category'].astype('category')

✔ 5.2 Creating New Columns (Feature Engineering)

Create derived features

df['total_amount'] = df['quantity'] * df['unit_price']

Extract from date

df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day

✔ 5.3 Binning and Categorization

Numeric to category

df['age_group'] = pd.cut(df['age'], bins=[0,18,40,60,100], labels=['Teen','Adult','Middle','Senior'])

✔ 5.4 Scaling & Normalization

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() df[['salary_scaled']] = scaler.fit_transform(df[['salary']])

???? 6. Data Integration

Used to combine multiple datasets together.

✔ 6.1 Merge DataFrames

df_merged = pd.merge(df1, df2, on='id', how='inner')

✔ 6.2 Concatenate

df_concat = pd.concat([df1, df2], axis=0)

✔ 6.3 Join

df_join = df1.join(df2.set_index('id'), on='id')

???? 7. Data Aggregation & Grouping

✔ Group by a column

df.groupby('city')['sales'].sum()

✔ Multiple aggregations

df.groupby('city').agg({'sales':'sum', 'quantity':'mean'})

???? 8. Data Reshaping

✔ Pivot Table

pivot = df.pivot_table(values='sales', index='city', columns='month', aggfunc='sum')

✔ Melt (Unpivot)

df_melt = pd.melt(df, id_vars=['id'], value_vars=['sales', 'profit'])

???? 9. Working with Large Datasets

When data is too big to load into memory:

✔ Read in chunks

for chunk in pd.read_csv("bigdata.csv", chunksize=50000): print(chunk.shape)

✔ Reduce memory usage

df['category'] = df['category'].astype('category') df['id'] = df['id'].astype('int32')

✔ Use Dask for big data

import dask.dataframe as dd df = dd.read_csv("bigdata.csv")

???? 10. Data Visualization (Quick Overview)

✔ Bar Plot

sns.barplot(x='city', y='sales', data=df)

✔ Histogram

df['age'].hist()

✔ Heatmap

sns.heatmap(df.corr(), annot=True)

???? 11. Exporting Cleaned Data

Save to CSV

df.to_csv("clean_data.csv", index=False)

Save to Excel

df.to_excel("clean_data.xlsx", index=False)

End-to-End Example (Real Data Wrangling)

import pandas as pd df = pd.read_csv("sales.csv") # Clean df.drop_duplicates(inplace=True) df['date'] = pd.to_datetime(df['date']) df['sales'].fillna(df['sales'].mean(), inplace=True) # Transform df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month # Wrangle monthly_sales = df.groupby(['year','month'])['sales'].sum().reset_index() # Export monthly_sales.to_csv("monthly_sales.csv", index=False)

Conclusion

Data wrangling is the most critical step in Data Science.
It includes cleaning, transforming, merging, reshaping, reducing, and exporting data for analytics or machine learning.


🔒 Some advanced sections are available for Registered Members
Register Now

Share this Post


← Back to Tutorials

Popular Competitive Exam Quizzes