How to Clean Messy Excel Data Using Python
A step-by-step guide to cleaning common Excel data problems — duplicates, missing values, inconsistent formatting, merged cells — using pandas, with a working example on a real messy dataset.

Every data pipeline starts with the same problem: the data is messy.
Duplicated rows, missing values, inconsistent formatting, random whitespace, mixed date formats, merged cells — these are not exceptions. They are the normal state of Excel data in the real world.
This guide walks through cleaning all of these issues with pandas, step by step, using a dataset that has every common problem.
# The Messy Dataset
Here is what real messy data looks like. Save this as messy_sales.xlsx:
| Date | Region | Product | Qty | Revenue | Rep Name |
|---|---|---|---|---|---|
| 2026-03-01 | North | Widget Pro | 150 | 4500 | John Smith |
| 03/01/2026 | north | Widget Pro | 150 | 4500.00 | John Smith |
| 2026-03-02 | South | widget lite | 1800 | Jane Doe | |
| 2026-03-02 | South | Widget Lite | -5 | jane doe | |
| 2026-03-03 | NORTH | Widget Pro | 175 | 5250 | John smith |
| West | Widget Pro | 200 | 6000 | Bob Wilson | |
| 2026-03-04 | East | Widget Lite | 120 | N/A | |
| 2026-03-05 | North | Widget Pro | 160 | 4800.00 | John Smith |
Problems visible:
- Duplicate rows (rows 1 and 2)
- Inconsistent date formats (
2026-03-01vs03/01/2026) - Inconsistent casing (
north,NORTH,North) - Missing values (empty cells)
- Invalid values (
-5quantity,N/Arevenue) - Leading/trailing whitespace
- Inconsistent product names (
widget litevsWidget Lite) - Inconsistent rep names (
John SmithvsJohn smith)
# Step 1: Load and Inspect
import pandas as pd
import numpy as np
# Load
df = pd.read_excel("messy_sales.xlsx")
# Inspect
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nFirst rows:\n{df.head(10)}")
Always inspect before cleaning. You need to know what you are dealing with.
# Data Quality Report
def data_quality_report(df):
"""Generate a quick data quality summary."""
report = pd.DataFrame({
"column": df.columns,
"dtype": df.dtypes.values,
"non_null": df.count().values,
"null_count": df.isnull().sum().values,
"null_pct": (df.isnull().sum() / len(df) * 100).round(1).values,
"unique": df.nunique().values,
"duplicates": [df[col].duplicated().sum() for col in df.columns],
})
return report
print(data_quality_report(df).to_string(index=False))
column dtype non_null null_count null_pct unique duplicates
Date object 7 1 12.5 5 2
Region object 8 0 0.0 5 3
Product object 8 0 0.0 4 4
Qty object 7 1 12.5 6 1
Revenue object 6 2 25.0 5 1
Rep Name object 7 1 12.5 5 2
# Step 2: Standardise Column Names
# Strip whitespace and normalise to snake_case
df.columns = (
df.columns
.str.strip()
.str.lower()
.str.replace(" ", "_")
.str.replace(r"[^\w]", "", regex=True)
)
print(df.columns.tolist())
# ['date', 'region', 'product', 'qty', 'revenue', 'rep_name']
# Step 3: Fix Whitespace
Whitespace is invisible and causes silent join failures:
# Strip all string columns
str_cols = df.select_dtypes(include="object").columns
for col in str_cols:
df[col] = df[col].astype(str).str.strip()
# Replace 'nan' strings (from NaN → str conversion) back to NaN
df = df.replace({"nan": np.nan, "": np.nan, "N/A": np.nan, "n/a": np.nan, "None": np.nan})
# Step 4: Standardise Text Casing
# Title case for names and products
df["region"] = df["region"].str.title()
df["product"] = df["product"].str.title()
df["rep_name"] = df["rep_name"].str.title()
print(df[["region", "product", "rep_name"]].drop_duplicates())
region product rep_name
North Widget Pro John Smith
South Widget Lite Jane Doe
West Widget Pro Bob Wilson
East Widget Lite NaN
Consistent. No more north vs NORTH vs North.
# Step 5: Parse Dates
Mixed date formats are one of the most common issues:
# pandas handles multiple formats automatically when errors='coerce'
df["date"] = pd.to_datetime(df["date"], errors="coerce", dayfirst=False)
# Check what could not be parsed
unparsed = df[df["date"].isna()]
if len(unparsed) > 0:
print(f"Warning: {len(unparsed)} dates could not be parsed")
print(unparsed)
# Step 6: Fix Numeric Columns
# Convert to numeric, coercing errors (like 'N/A') to NaN
df["qty"] = pd.to_numeric(df["qty"], errors="coerce")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
# Handle invalid values
df.loc[df["qty"] < 0, "qty"] = np.nan # Negative quantities are invalid
df.loc[df["revenue"] < 0, "revenue"] = np.nan # Negative revenue invalid
# Step 7: Handle Missing Values
Different strategies for different columns:
# Strategy 1: Drop rows where critical fields are missing
critical_columns = ["date", "region", "product"]
before = len(df)
df = df.dropna(subset=critical_columns)
print(f"Dropped {before - len(df)} rows with missing critical fields")
# Strategy 2: Fill numeric columns with 0
df["qty"] = df["qty"].fillna(0).astype(int)
# Strategy 3: Fill text with 'Unknown'
df["rep_name"] = df["rep_name"].fillna("Unknown")
# Strategy 4: Leave revenue as NaN (better to have no data than wrong data)
# Will be excluded from calculations automatically
# Choosing a Strategy
| Column type | Strategy | When to use |
|---|---|---|
| ID / Key fields | Drop row | Cannot aggregate without identity |
| Dates | Drop or impute from context | Critical for time-series |
| Quantities | Fill with 0 or drop | Zero is a valid business value |
| Revenue | Leave as NaN | Better to exclude than guess |
| Text/names | Fill with "Unknown" | Preserves row for aggregation |
# Step 8: Remove Duplicates
# Exact duplicates
before = len(df)
df = df.drop_duplicates()
print(f"Removed {before - len(df)} exact duplicates")
# Business-key duplicates (same date + region + product + rep)
before = len(df)
df = df.drop_duplicates(subset=["date", "region", "product", "rep_name"], keep="last")
print(f"Removed {before - len(df)} business-key duplicates")
# Step 9: Validate the Result
def validate_clean_data(df):
"""Run validation checks on cleaned data."""
checks = {
"No null dates": df["date"].notna().all(),
"No null regions": df["region"].notna().all(),
"No null products": df["product"].notna().all(),
"No negative quantities": (df["qty"] >= 0).all(),
"No duplicate keys": not df.duplicated(subset=["date", "region", "product", "rep_name"]).any(),
"Dates in valid range": df["date"].between("2020-01-01", "2030-12-31").all(),
"Revenue non-negative (where present)": df["revenue"].dropna().ge(0).all(),
}
print("\nValidation Results:")
all_passed = True
for check, passed in checks.items():
status = "✓" if passed else "✗"
print(f" {status} {check}")
if not passed:
all_passed = False
return all_passed
validate_clean_data(df)
Validation Results:
✓ No null dates
✓ No null regions
✓ No null products
✓ No negative quantities
✓ No duplicate keys
✓ Dates in valid range
✓ Revenue non-negative (where present)
# The Complete Cleaning Pipeline
"""
Data Cleaning Pipeline
Reads messy Excel data, cleans all common issues, validates, and exports.
"""
import pandas as pd
import numpy as np
from datetime import datetime
def clean_excel_data(input_path, output_path):
"""Full cleaning pipeline: load → clean → validate → export."""
# Load
df = pd.read_excel(input_path)
print(f"Loaded: {len(df)} rows, {len(df.columns)} columns")
# Standardise columns
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
# Strip whitespace from strings
str_cols = df.select_dtypes(include="object").columns
for col in str_cols:
df[col] = df[col].astype(str).str.strip()
# Replace placeholder values with NaN
df = df.replace({"nan": np.nan, "": np.nan, "N/A": np.nan, "n/a": np.nan, "None": np.nan})
# Standardise text casing
for col in ["region", "product", "rep_name"]:
if col in df.columns:
df[col] = df[col].str.title()
# Parse dates
if "date" in df.columns:
df["date"] = pd.to_datetime(df["date"], errors="coerce")
# Fix numeric columns
for col in ["qty", "revenue"]:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
df.loc[df[col] < 0, col] = np.nan
# Handle missing values
df = df.dropna(subset=["date", "region", "product"])
df["qty"] = df["qty"].fillna(0).astype(int)
df["rep_name"] = df["rep_name"].fillna("Unknown")
# Remove duplicates
df = df.drop_duplicates()
df = df.drop_duplicates(
subset=["date", "region", "product", "rep_name"], keep="last"
)
# Sort
df = df.sort_values(["date", "region"]).reset_index(drop=True)
# Export
df.to_excel(output_path, index=False)
print(f"Cleaned: {len(df)} rows exported to {output_path}")
return df
if __name__ == "__main__":
clean_df = clean_excel_data("messy_sales.xlsx", "clean_sales.xlsx")
# Before and After
| Issue | Before | After |
|---|---|---|
| Total rows | 8 | 5 (duplicates + invalid removed) |
| Date formats | Mixed (2026-03-01, 03/01/2026) |
Consistent (datetime64) |
| Region casing | north, NORTH, North |
North |
| Missing quantities | Empty cells | 0 |
| Invalid revenue | N/A |
NaN (excluded from calculations) |
| Duplicates | 2 exact + 1 business-key | 0 |
# Common Patterns Reference
| Problem | pandas solution |
|---|---|
| Strip whitespace | df[col].str.strip() |
| Fix casing | df[col].str.title() or .str.lower() |
| Parse dates | pd.to_datetime(col, errors="coerce") |
| Fix numbers | pd.to_numeric(col, errors="coerce") |
| Fill blanks | df[col].fillna(value) |
| Remove duplicates | df.drop_duplicates(subset=[...]) |
| Replace values | df.replace({"old": "new"}) |
| Drop invalid rows | df.dropna(subset=[...]) |
| Validate range | df[col].between(min, max) |
# Next Steps
Clean data is the prerequisite for everything — dashboards, reports, analysis, machine learning. Without it, every downstream output is unreliable.
This cleaning pipeline is reusable. Point it at any Excel file, adjust the column names, and the same patterns apply.
For building dashboards from cleaned data, see How to Build a Data Dashboard Without Manual Excel Work. For designing the full pipeline architecture, see How to Design Data Pipelines for Reliable Reporting Systems.
Data & dashboard services include data cleaning as the first step in every pipeline build.
Get in touch to discuss cleaning and automating your data workflows.
Enjoyed this article?
Get notified when I publish new articles on automation, ecommerce, and data engineering.