Working with Real Data in Python

📘 Python for Data Science 👁 49 views 📅 Nov 14, 2025
⏱ Estimated reading time: 4 min

Working with real-world data is a fundamental task in data science and analytics. Unlike clean datasets used for learning purposes, real data is often messy, incomplete, inconsistent, unstructured, and large. Python provides powerful libraries to handle these challenges effectively.
The main libraries used are:

  • Pandas → Data loading, cleaning, analysis

  • NumPy → Numerical operations

  • Matplotlib / Seaborn → Visualization

  • Scikit-learn → ML preprocessing

  • Requests / BeautifulSoup → Web data extraction

This answer explains all major steps involved in real-world data processing.


-----------------------------------------

1. Importing Essential Libraries

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns

-----------------------------------------

2. Loading Real Data

Real data may come in different formats:

✔ CSV File

df = pd.read_csv("data.csv")

✔ Excel File

df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

✔ JSON File

df = pd.read_json("data.json")

✔ Database (MySQL example)

import sqlalchemy engine = sqlalchemy.create_engine("mysql+pymysql://root:123@localhost/dbname") df = pd.read_sql("SELECT * FROM table_name", engine)

✔ Web API

import requests data = requests.get("https://api.example.com/users").json() df = pd.DataFrame(data)

✔ HTML Tables (Web Scraping)

df_list = pd.read_html("https://example.com/table-page") df = df_list[0]

-----------------------------------------

3. Understanding the Data

After loading, we inspect the dataset to get an overview.

df.head() df.tail() df.info() df.describe() df.shape df.columns

Helps to identify:

  • Data types

  • Missing values

  • Outliers

  • Basic statistics

  • Number of rows and columns


-----------------------------------------

4. Data Cleaning (Most Important Step)

Real datasets often contain problems like missing values, duplicates, wrong data types, inconsistent text, etc.

✔ Checking Missing Values

df.isnull().sum()

✔ Filling Missing Values

df['Age'].fillna(df['Age'].median(), inplace=True) df['City'].fillna(df['City'].mode()[0], inplace=True)

✔ Removing Missing

df.dropna(subset=['Salary'], inplace=True)

✔ Handling Duplicates

df.drop_duplicates(inplace=True)

✔ Converting Data Types

df['Date'] = pd.to_datetime(df['Date']) df['Sales'] = df['Sales'].astype(float)

✔ Fixing Text Data

df['City'] = df['City'].str.strip().str.title()

✔ Handling Outliers (IQR Method)

Q1 = df['Amount'].quantile(0.25) Q3 = df['Amount'].quantile(0.75) IQR = Q3 - Q1 df = df[(df['Amount'] >= Q1 - 1.5*IQR) & (df['Amount'] <= Q3 + 1.5*IQR)]

-----------------------------------------

5. Exploratory Data Analysis (EDA)

EDA helps discover patterns, relationships, and anomalies in data.

✔ Summary Statistics

df.describe(include='all')

✔ Correlation Matrix

df.corr()

✔ Value Counts (Categorical)

df['City'].value_counts()

-----------------------------------------

6. Data Visualization

Visualization makes patterns visible.

✔ Histogram (Distribution)

sns.histplot(df['Age'], kde=True)

✔ Bar Chart

sns.barplot(x='City', y='Sales', data=df)

✔ Scatter Plot (Relationship)

sns.scatterplot(x='Age', y='Salary', data=df)

✔ Box Plot (Outliers)

sns.boxplot(x='Gender', y='Salary', data=df)

✔ Heatmap (Correlation)

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

-----------------------------------------

7. Feature Engineering

Creating new features improves data quality for ML models.

✔ Creating new columns

df['AgeGroup'] = pd.cut(df['Age'], bins=[0,18,35,60], labels=['Teen','Adult','Senior'])

✔ Date operations

df['Year'] = df['Date'].dt.year df['Month'] = df['Date'].dt.month

✔ Encoding categorical variables

df = pd.get_dummies(df, columns=['Gender', 'City'], drop_first=True)

-----------------------------------------

8. Handling Large Datasets

✔ Load in chunks

chunks = pd.read_csv("big.csv", chunksize=50000)

✔ Memory optimization

df['Age'] = df['Age'].astype('int16') df['Income'] = df['Income'].astype('float32')

-----------------------------------------

9. Saving Cleaned Data

Save CSV

df.to_csv("cleaned_data.csv", index=False)

Save Excel

df.to_excel("cleaned_data.xlsx", index=False)

Save to SQL

df.to_sql('clean_data', engine, index=False, if_exists='replace')

-----------------------------------------

10. Real Data Workflow Example

A complete practical flow:

df = pd.read_csv("sales.csv") # Clean data df.drop_duplicates(inplace=True) df['Date'] = pd.to_datetime(df['Date']) df['Amount'].fillna(df['Amount'].median(), inplace=True) # EDA sns.histplot(df['Amount']) sns.boxplot(x='Category', y='Amount', data=df) # Feature engineering df['Year'] = df['Date'].dt.year df['Month'] = df['Date'].dt.month # Save processed data df.to_csv("final_sales_clean.csv", index=False)

Conclusion

Working with real data in Python involves:

  1. Loading data from various sources

  2. Understanding the structure

  3. Cleaning & preprocessing

  4. Exploration & visualization

  5. Feature engineering

  6. Storing cleaned data




🔒 Some advanced sections are available for Registered Members
Register Now

Share this Post


← Back to Tutorials

Popular Competitive Exam Quizzes