Working with real-world data is a fundamental task in data science and analytics. Unlike clean datasets used for learning purposes, real data is often messy, incomplete, inconsistent, unstructured, and large. Python provides powerful libraries to handle these challenges effectively.
The main libraries used are:
-
Pandas → Data loading, cleaning, analysis
-
NumPy → Numerical operations
-
Matplotlib / Seaborn → Visualization
-
Scikit-learn → ML preprocessing
-
Requests / BeautifulSoup → Web data extraction
This answer explains all major steps involved in real-world data processing.
-----------------------------------------
1. Importing Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
-----------------------------------------
2. Loading Real Data
Real data may come in different formats:
✔ CSV File
df = pd.read_csv("data.csv")
✔ Excel File
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
✔ JSON File
df = pd.read_json("data.json")
✔ Database (MySQL example)
import sqlalchemy
engine = sqlalchemy.create_engine("mysql+pymysql://root:123@localhost/dbname")
df = pd.read_sql("SELECT * FROM table_name", engine)
✔ Web API
import requests
data = requests.get("https://api.example.com/users").json()
df = pd.DataFrame(data)
✔ HTML Tables (Web Scraping)
df_list = pd.read_html("https://example.com/table-page")
df = df_list[0]
-----------------------------------------
3. Understanding the Data
After loading, we inspect the dataset to get an overview.
df.head() df.tail() df.info() df.describe() df.shape df.columns
Helps to identify:
-
Data types
-
Missing values
-
Outliers
-
Basic statistics
-
Number of rows and columns
-----------------------------------------
4. Data Cleaning (Most Important Step)
Real datasets often contain problems like missing values, duplicates, wrong data types, inconsistent text, etc.
✔ Checking Missing Values
df.isnull().sum()
✔ Filling Missing Values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['City'].fillna(df['City'].mode()[0], inplace=True)
✔ Removing Missing
df.dropna(subset=['Salary'], inplace=True)
✔ Handling Duplicates
df.drop_duplicates(inplace=True)
✔ Converting Data Types
df['Date'] = pd.to_datetime(df['Date'])
df['Sales'] = df['Sales'].astype(float)
✔ Fixing Text Data
df['City'] = df['City'].str.strip().str.title()
✔ Handling Outliers (IQR Method)
Q1 = df['Amount'].quantile(0.25)
Q3 = df['Amount'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Amount'] >= Q1 - 1.5*IQR) & (df['Amount'] <= Q3 + 1.5*IQR)]
-----------------------------------------
5. Exploratory Data Analysis (EDA)
EDA helps discover patterns, relationships, and anomalies in data.
✔ Summary Statistics
df.describe(include='all')
✔ Correlation Matrix
df.corr()
✔ Value Counts (Categorical)
df['City'].value_counts()
-----------------------------------------
6. Data Visualization
Visualization makes patterns visible.
✔ Histogram (Distribution)
sns.histplot(df['Age'], kde=True)
✔ Bar Chart
sns.barplot(x='City', y='Sales', data=df)
✔ Scatter Plot (Relationship)
sns.scatterplot(x='Age', y='Salary', data=df)
✔ Box Plot (Outliers)
sns.boxplot(x='Gender', y='Salary', data=df)
✔ Heatmap (Correlation)
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
-----------------------------------------
7. Feature Engineering
Creating new features improves data quality for ML models.
✔ Creating new columns
df['AgeGroup'] = pd.cut(df['Age'], bins=[0,18,35,60], labels=['Teen','Adult','Senior'])
✔ Date operations
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
✔ Encoding categorical variables
df = pd.get_dummies(df, columns=['Gender', 'City'], drop_first=True)
-----------------------------------------
8. Handling Large Datasets
✔ Load in chunks
chunks = pd.read_csv("big.csv", chunksize=50000)
✔ Memory optimization
df['Age'] = df['Age'].astype('int16')
df['Income'] = df['Income'].astype('float32')
-----------------------------------------
9. Saving Cleaned Data
Save CSV
df.to_csv("cleaned_data.csv", index=False)
Save Excel
df.to_excel("cleaned_data.xlsx", index=False)
Save to SQL
df.to_sql('clean_data', engine, index=False, if_exists='replace')
-----------------------------------------
10. Real Data Workflow Example
A complete practical flow:
df = pd.read_csv("sales.csv")
# Clean data
df.drop_duplicates(inplace=True)
df['Date'] = pd.to_datetime(df['Date'])
df['Amount'].fillna(df['Amount'].median(), inplace=True)
# EDA
sns.histplot(df['Amount'])
sns.boxplot(x='Category', y='Amount', data=df)
# Feature engineering
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
# Save processed data
df.to_csv("final_sales_clean.csv", index=False)
Conclusion
Working with real data in Python involves:
-
Loading data from various sources
-
Understanding the structure
-
Cleaning & preprocessing
-
Exploration & visualization
-
Feature engineering
-
Storing cleaned data