Guidelines for Using Python in Data Cleansing, Profiling, and Analysis

info

The following content is recommended.

📦 Tools & Libraries

Before diving into the process, ensure the following Python libraries are installed:

pip install pandas numpy matplotlib seaborn missingno openpyxl scipy pandas-profiling

Additional Tools for Advanced Use:

pyjanitor: for cleaner data transformations
dask: for large datasets
dataprep: for automated data profiling
scikit-learn: for statistical analysis and modeling

1. Data Loading and Inspection

✅ Best Practices

Always inspect data types and structure first.
Handle large files in chunks if needed.

🔍 Sample Code

import pandas as pd

# Load data
df = pd.read_csv("data.csv")

# Preview data
print(df.head())
print(df.info())
print(df.describe())

2. Data Cleansing

🎯 Goals

Handle missing values
Fix data types
Remove duplicates
Standardize text formats
Detect outliers (if needed)

🧼 Techniques

a. Missing Values

# Check missing values
print(df.isnull().sum())

# Fill missing values
df["age"] = df["age"].fillna(df["age"].median())

# Drop rows with missing emails
df.dropna(subset=["email"], inplace=True)

b. Duplicates

df.drop_duplicates(inplace=True)

c. Data Types

df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["category"] = df["category"].astype("category")

d. Text Normalization

df["name"] = df["name"].str.strip().str.lower().str.replace(r"[^a-z\s]", "", regex=True)

e. Outlier Detection

import numpy as np
from scipy.stats import zscore

df["zscore"] = zscore(df["salary"])
df = df[df["zscore"].abs() < 3]

3. Data Profiling

🎯 Goals

Understand data distribution
Identify relationships
Detect anomalies or imbalances

🛠️ Tools & Methods

a. Using `pandas-profiling`

from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Data Profiling Report", explorative=True)
profile.to_file("profiling_report.html")

b. Manual Profiling

# Categorical distribution
print(df["gender"].value_counts())

# Correlation matrix
print(df.corr(numeric_only=True))

# Null value heatmap
import missingno as msno
msno.matrix(df)

4. Data Analysis

🎯 Objectives

Identify trends, patterns, and insights
Prepare data for modeling
Segment data for business analysis

🧪 Examples

a. Grouping and Aggregation

sales_by_region = df.groupby("region")["sales"].sum().reset_index()

b. Time Series Analysis

df.set_index("date", inplace=True)
monthly_sales = df["sales"].resample("M").sum()

c. Correlation & Relationships

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.show()

d. Trend Visualization

sns.lineplot(data=monthly_sales)
plt.title("Monthly Sales Trend")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.show()

5. Documentation & Reporting

📝 Tips

Use Jupyter Notebooks or Markdown for data exploration notes.
Keep track of all data changes (data lineage).
Save cleaned datasets with versioning (e.g., data_cleaned_v1.csv).
Document assumptions and decisions.

6. General Best Practices

Use version control (e.g., GitHub) for scripts and notebooks.
Automate repetitive tasks using functions or scripts.
Modularize code for reusability (e.g., clean_data() function).
Always validate assumptions with exploratory data analysis.
Ensure reproducibility (set random seeds, document the environment).

7. Using Jupyter Notebook

💡 What is Jupyter Notebook?

Jupyter Notebook is an open-source web application that allows you to create and share documents containing:

Live code
Visualizations
Narrative text (Markdown)
Equations (LaTeX)

🚀 Step 1: Install Jupyter Notebook

pip install jupyterlab notebook

🚀 Step 2: Launch Jupyter Notebook

python -m notebook

🧠 Tips

Use cells logically: one function or analysis per cell.
Name your notebooks clearly (e.g., data_cleaning.ipynb).
Use Markdown cells to explain your thinking.
Restart the kernel if memory issues arise (Kernel → Restart).

📦 Tools & Libraries​

Additional Tools for Advanced Use:​

1. Data Loading and Inspection​

✅ Best Practices​

🔍 Sample Code​

2. Data Cleansing​

🎯 Goals​

🧼 Techniques​

a. Missing Values​

b. Duplicates​

c. Data Types​

d. Text Normalization​

e. Outlier Detection​

3. Data Profiling​

🎯 Goals​

🛠️ Tools & Methods​

a. Using pandas-profiling​

b. Manual Profiling​

4. Data Analysis​

🎯 Objectives​

🧪 Examples​

a. Grouping and Aggregation​

b. Time Series Analysis​

c. Correlation & Relationships​

d. Trend Visualization​

5. Documentation & Reporting​

📝 Tips​

6. General Best Practices​

7. Using Jupyter Notebook​

💡 What is Jupyter Notebook?​

🚀 Step 1: Install Jupyter Notebook​

🚀 Step 2: Launch Jupyter Notebook​

🧠 Tips​

📦 Tools & Libraries

Additional Tools for Advanced Use:

1. Data Loading and Inspection

✅ Best Practices

🔍 Sample Code

2. Data Cleansing

🎯 Goals

🧼 Techniques

a. Missing Values

b. Duplicates

c. Data Types

d. Text Normalization

e. Outlier Detection

3. Data Profiling

🎯 Goals

🛠️ Tools & Methods

a. Using `pandas-profiling`

b. Manual Profiling

4. Data Analysis

🎯 Objectives

🧪 Examples

a. Grouping and Aggregation

b. Time Series Analysis

c. Correlation & Relationships

d. Trend Visualization

5. Documentation & Reporting

📝 Tips

6. General Best Practices

7. Using Jupyter Notebook

💡 What is Jupyter Notebook?

🚀 Step 1: Install Jupyter Notebook

🚀 Step 2: Launch Jupyter Notebook

🧠 Tips