Guidelines for Using Python in Data Cleansing, Profiling, and Analysis
info
The following content is recommended.
๐ฆ Tools & Librariesโ
Before diving into the process, ensure the following Python libraries are installed:
pip install pandas numpy matplotlib seaborn missingno openpyxl scipy pandas-profiling
Additional Tools for Advanced Use:โ
pyjanitor
: for cleaner data transformationsdask
: for large datasetsdataprep
: for automated data profilingscikit-learn
: for statistical analysis and modeling
1. Data Loading and Inspectionโ
โ Best Practicesโ
- Always inspect data types and structure first.
- Handle large files in chunks if needed.
๐ Sample Codeโ
import pandas as pd
# Load data
df = pd.read_csv("data.csv")
# Preview data
print(df.head())
print(df.info())
print(df.describe())
2. Data Cleansingโ
๐ฏ Goalsโ
- Handle missing values
- Fix data types
- Remove duplicates
- Standardize text formats
- Detect outliers (if needed)
๐งผ Techniquesโ
a. Missing Valuesโ
# Check missing values
print(df.isnull().sum())
# Fill missing values
df["age"] = df["age"].fillna(df["age"].median())
# Drop rows with missing emails
df.dropna(subset=["email"], inplace=True)
b. Duplicatesโ
df.drop_duplicates(inplace=True)
c. Data Typesโ
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["category"] = df["category"].astype("category")
d. Text Normalizationโ
df["name"] = df["name"].str.strip().str.lower().str.replace(r"[^a-z\s]", "", regex=True)
e. Outlier Detectionโ
import numpy as np
from scipy.stats import zscore
df["zscore"] = zscore(df["salary"])
df = df[df["zscore"].abs() < 3]
3. Data Profilingโ
๐ฏ Goalsโ
- Understand data distribution
- Identify relationships
- Detect anomalies or imbalances
๐ ๏ธ Tools & Methodsโ
a. Using pandas-profiling
โ
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Data Profiling Report", explorative=True)
profile.to_file("profiling_report.html")
b. Manual Profilingโ
# Categorical distribution
print(df["gender"].value_counts())
# Correlation matrix
print(df.corr(numeric_only=True))
# Null value heatmap
import missingno as msno
msno.matrix(df)
4. Data Analysisโ
๐ฏ Objectivesโ
- Identify trends, patterns, and insights
- Prepare data for modeling
- Segment data for business analysis
๐งช Examplesโ
a. Grouping and Aggregationโ
sales_by_region = df.groupby("region")["sales"].sum().reset_index()
b. Time Series Analysisโ
df.set_index("date", inplace=True)
monthly_sales = df["sales"].resample("M").sum()
c. Correlation & Relationshipsโ
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.show()
d. Trend Visualizationโ
sns.lineplot(data=monthly_sales)
plt.title("Monthly Sales Trend")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.show()
5. Documentation & Reportingโ
๐ Tipsโ
- Use Jupyter Notebooks or Markdown for data exploration notes.
- Keep track of all data changes (data lineage).
- Save cleaned datasets with versioning (e.g.,
data_cleaned_v1.csv
). - Document assumptions and decisions.
6. General Best Practicesโ
- Use version control (e.g., GitHub) for scripts and notebooks.
- Automate repetitive tasks using functions or scripts.
- Modularize code for reusability (e.g.,
clean_data()
function). - Always validate assumptions with exploratory data analysis.
- Ensure reproducibility (set random seeds, document the environment).
7. Using Jupyter Notebookโ
๐ก What is Jupyter Notebook?โ
Jupyter Notebook is an open-source web application that allows you to create and share documents containing:
- Live code
- Visualizations
- Narrative text (Markdown)
- Equations (LaTeX)
๐ Step 1: Install Jupyter Notebookโ
pip install jupyterlab notebook
๐ Step 2: Launch Jupyter Notebookโ
python -m notebook
๐ง Tipsโ
- Use cells logically: one function or analysis per cell.
- Name your notebooks clearly (e.g.,
data_cleaning.ipynb
). - Use Markdown cells to explain your thinking.
- Restart the kernel if memory issues arise (
Kernel โ Restart
).