Skip to main content

Guidelines for Using Python in Data Cleansing, Profiling, and Analysis

info

The following content is recommended.

๐Ÿ“ฆ Tools & Librariesโ€‹

Before diving into the process, ensure the following Python libraries are installed:

pip install pandas numpy matplotlib seaborn missingno openpyxl scipy pandas-profiling

Additional Tools for Advanced Use:โ€‹

  • pyjanitor: for cleaner data transformations
  • dask: for large datasets
  • dataprep: for automated data profiling
  • scikit-learn: for statistical analysis and modeling

1. Data Loading and Inspectionโ€‹

โœ… Best Practicesโ€‹

  • Always inspect data types and structure first.
  • Handle large files in chunks if needed.

๐Ÿ” Sample Codeโ€‹

import pandas as pd

# Load data
df = pd.read_csv("data.csv")

# Preview data
print(df.head())
print(df.info())
print(df.describe())

2. Data Cleansingโ€‹

๐ŸŽฏ Goalsโ€‹

  • Handle missing values
  • Fix data types
  • Remove duplicates
  • Standardize text formats
  • Detect outliers (if needed)

๐Ÿงผ Techniquesโ€‹

a. Missing Valuesโ€‹

# Check missing values
print(df.isnull().sum())

# Fill missing values
df["age"] = df["age"].fillna(df["age"].median())

# Drop rows with missing emails
df.dropna(subset=["email"], inplace=True)

b. Duplicatesโ€‹

df.drop_duplicates(inplace=True)

c. Data Typesโ€‹

df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["category"] = df["category"].astype("category")

d. Text Normalizationโ€‹

df["name"] = df["name"].str.strip().str.lower().str.replace(r"[^a-z\s]", "", regex=True)

e. Outlier Detectionโ€‹

import numpy as np
from scipy.stats import zscore

df["zscore"] = zscore(df["salary"])
df = df[df["zscore"].abs() < 3]

3. Data Profilingโ€‹

๐ŸŽฏ Goalsโ€‹

  • Understand data distribution
  • Identify relationships
  • Detect anomalies or imbalances

๐Ÿ› ๏ธ Tools & Methodsโ€‹

a. Using pandas-profilingโ€‹

from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Data Profiling Report", explorative=True)
profile.to_file("profiling_report.html")

b. Manual Profilingโ€‹

# Categorical distribution
print(df["gender"].value_counts())

# Correlation matrix
print(df.corr(numeric_only=True))

# Null value heatmap
import missingno as msno
msno.matrix(df)

4. Data Analysisโ€‹

๐ŸŽฏ Objectivesโ€‹

  • Identify trends, patterns, and insights
  • Prepare data for modeling
  • Segment data for business analysis

๐Ÿงช Examplesโ€‹

a. Grouping and Aggregationโ€‹

sales_by_region = df.groupby("region")["sales"].sum().reset_index()

b. Time Series Analysisโ€‹

df.set_index("date", inplace=True)
monthly_sales = df["sales"].resample("M").sum()

c. Correlation & Relationshipsโ€‹

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.show()

d. Trend Visualizationโ€‹

sns.lineplot(data=monthly_sales)
plt.title("Monthly Sales Trend")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.show()

5. Documentation & Reportingโ€‹

๐Ÿ“ Tipsโ€‹

  • Use Jupyter Notebooks or Markdown for data exploration notes.
  • Keep track of all data changes (data lineage).
  • Save cleaned datasets with versioning (e.g., data_cleaned_v1.csv).
  • Document assumptions and decisions.

6. General Best Practicesโ€‹

  • Use version control (e.g., GitHub) for scripts and notebooks.
  • Automate repetitive tasks using functions or scripts.
  • Modularize code for reusability (e.g., clean_data() function).
  • Always validate assumptions with exploratory data analysis.
  • Ensure reproducibility (set random seeds, document the environment).

7. Using Jupyter Notebookโ€‹

๐Ÿ’ก What is Jupyter Notebook?โ€‹

Jupyter Notebook is an open-source web application that allows you to create and share documents containing:

  • Live code
  • Visualizations
  • Narrative text (Markdown)
  • Equations (LaTeX)

๐Ÿš€ Step 1: Install Jupyter Notebookโ€‹

pip install jupyterlab notebook

๐Ÿš€ Step 2: Launch Jupyter Notebookโ€‹

python -m notebook

๐Ÿง  Tipsโ€‹

  • Use cells logically: one function or analysis per cell.
  • Name your notebooks clearly (e.g., data_cleaning.ipynb).
  • Use Markdown cells to explain your thinking.
  • Restart the kernel if memory issues arise (Kernel โ†’ Restart).