How to Clean Messy Excel Data Using Python

5 Apr 2026 · 7 min read · Data & Dashboards

A step-by-step guide to cleaning common Excel data problems — duplicates, missing values, inconsistent formatting, merged cells — using pandas, with a working example on a real messy dataset.

How to Clean Messy Excel Data Using Python

Every data pipeline starts with the same problem: the data is messy.

Duplicated rows, missing values, inconsistent formatting, random whitespace, mixed date formats, merged cells — these are not exceptions. They are the normal state of Excel data in the real world.

This guide walks through cleaning all of these issues with pandas, step by step, using a dataset that has every common problem.

Who This Is For

Analysts who receive messy spreadsheets from clients or internal teams and spend hours fixing them
Business users who know their Excel files are inconsistent but do not have a repeatable way to clean them
Vibe coders building automations that ingest data from non-technical users who format things inconsistently
Anyone who has ever spent a morning manually removing duplicates, fixing dates, or standardising column names

This is one of the most beginner-friendly guides in the series. If you can open a terminal and run a Python script, you can follow along. Every step is explained with before/after examples.

The Messy Dataset

Here is what real messy data looks like. Save this as messy_sales.xlsx:

Date	Region	Product	Qty	Revenue	Rep Name
2026-03-01	North	Widget Pro	150	4500	John Smith
03/01/2026	north	Widget Pro	150	4500.00	John Smith
2026-03-02	South	widget lite		1800	Jane Doe
2026-03-02	South	Widget Lite	-5		jane doe
2026-03-03	NORTH	Widget Pro	175	5250	John smith
	West	Widget Pro	200	6000	Bob Wilson
2026-03-04	East	Widget Lite	120	N/A
2026-03-05	North	Widget Pro	160	4800.00	John Smith

Problems visible:

Duplicate rows (rows 1 and 2)
Inconsistent date formats (2026-03-01 vs 03/01/2026)
Inconsistent casing (north, NORTH, North)
Missing values (empty cells)
Invalid values (-5 quantity, N/A revenue)
Leading/trailing whitespace
Inconsistent product names (widget lite vs Widget Lite)
Inconsistent rep names (John Smith vs John smith)

Step 1: Load and Inspect

import pandas as pd
import numpy as np

# Load
df = pd.read_excel("messy_sales.xlsx")

# Inspect
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nFirst rows:\n{df.head(10)}")

Always inspect before cleaning. You need to know what you are dealing with.

Data Quality Report

def data_quality_report(df):
    """Generate a quick data quality summary."""
    report = pd.DataFrame({
        "column": df.columns,
        "dtype": df.dtypes.values,
        "non_null": df.count().values,
        "null_count": df.isnull().sum().values,
        "null_pct": (df.isnull().sum() / len(df) * 100).round(1).values,
        "unique": df.nunique().values,
        "duplicates": [df[col].duplicated().sum() for col in df.columns],
    })
    return report

print(data_quality_report(df).to_string(index=False))

    column   dtype  non_null  null_count  null_pct  unique  duplicates
      Date  object        7           1      12.5       5           2
    Region  object        8           0       0.0       5           3
   Product  object        8           0       0.0       4           4
       Qty  object        7           1      12.5       6           1
   Revenue  object        6           2      25.0       5           1
  Rep Name  object        7           1      12.5       5           2

Step 2: Standardise Column Names

# Strip whitespace and normalise to snake_case
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
    .str.replace(r"[^\w]", "", regex=True)
)

print(df.columns.tolist())
# ['date', 'region', 'product', 'qty', 'revenue', 'rep_name']

Step 3: Fix Whitespace

Whitespace is invisible and causes silent join failures:

# Strip all string columns
str_cols = df.select_dtypes(include="object").columns
for col in str_cols:
    df[col] = df[col].astype(str).str.strip()

# Replace 'nan' strings (from NaN → str conversion) back to NaN
df = df.replace({"nan": np.nan, "": np.nan, "N/A": np.nan, "n/a": np.nan, "None": np.nan})

Step 4: Standardise Text Casing

# Title case for names and products
df["region"] = df["region"].str.title()
df["product"] = df["product"].str.title()
df["rep_name"] = df["rep_name"].str.title()

print(df[["region", "product", "rep_name"]].drop_duplicates())

  region     product     rep_name
   North  Widget Pro   John Smith
   South  Widget Lite    Jane Doe
    West  Widget Pro   Bob Wilson
    East  Widget Lite          NaN

Consistent. No more north vs NORTH vs North.

Step 5: Parse Dates

Mixed date formats are one of the most common issues:

# pandas handles multiple formats automatically when errors='coerce'
df["date"] = pd.to_datetime(df["date"], errors="coerce", dayfirst=False)

# Check what could not be parsed
unparsed = df[df["date"].isna()]
if len(unparsed) > 0:
    print(f"Warning: {len(unparsed)} dates could not be parsed")
    print(unparsed)

Step 6: Fix Numeric Columns

# Convert to numeric, coercing errors (like 'N/A') to NaN
df["qty"] = pd.to_numeric(df["qty"], errors="coerce")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

# Handle invalid values
df.loc[df["qty"] < 0, "qty"] = np.nan  # Negative quantities are invalid
df.loc[df["revenue"] < 0, "revenue"] = np.nan  # Negative revenue invalid

Step 7: Handle Missing Values

Different strategies for different columns:

# Strategy 1: Drop rows where critical fields are missing
critical_columns = ["date", "region", "product"]
before = len(df)
df = df.dropna(subset=critical_columns)
print(f"Dropped {before - len(df)} rows with missing critical fields")

# Strategy 2: Fill numeric columns with 0
df["qty"] = df["qty"].fillna(0).astype(int)

# Strategy 3: Fill text with 'Unknown'
df["rep_name"] = df["rep_name"].fillna("Unknown")

# Strategy 4: Leave revenue as NaN (better to have no data than wrong data)
# Will be excluded from calculations automatically

Choosing a Strategy

Column type	Strategy	When to use
ID / Key fields	Drop row	Cannot aggregate without identity
Dates	Drop or impute from context	Critical for time-series
Quantities	Fill with 0 or drop	Zero is a valid business value
Revenue	Leave as NaN	Better to exclude than guess
Text/names	Fill with “Unknown”	Preserves row for aggregation

Step 8: Remove Duplicates

# Exact duplicates
before = len(df)
df = df.drop_duplicates()
print(f"Removed {before - len(df)} exact duplicates")

# Business-key duplicates (same date + region + product + rep)
before = len(df)
df = df.drop_duplicates(subset=["date", "region", "product", "rep_name"], keep="last")
print(f"Removed {before - len(df)} business-key duplicates")

Step 9: Validate the Result

def validate_clean_data(df):
    """Run validation checks on cleaned data."""
    checks = {
        "No null dates": df["date"].notna().all(),
        "No null regions": df["region"].notna().all(),
        "No null products": df["product"].notna().all(),
        "No negative quantities": (df["qty"] >= 0).all(),
        "No duplicate keys": not df.duplicated(subset=["date", "region", "product", "rep_name"]).any(),
        "Dates in valid range": df["date"].between("2020-01-01", "2030-12-31").all(),
        "Revenue non-negative (where present)": df["revenue"].dropna().ge(0).all(),
    }

    print("\nValidation Results:")
    all_passed = True
    for check, passed in checks.items():
        status = "✓" if passed else "✗"
        print(f"  {status} {check}")
        if not passed:
            all_passed = False

    return all_passed

validate_clean_data(df)

Validation Results:
  ✓ No null dates
  ✓ No null regions
  ✓ No null products
  ✓ No negative quantities
  ✓ No duplicate keys
  ✓ Dates in valid range
  ✓ Revenue non-negative (where present)

The Complete Cleaning Pipeline

"""
Data Cleaning Pipeline
Reads messy Excel data, cleans all common issues, validates, and exports.
"""
import pandas as pd
import numpy as np
from datetime import datetime


def clean_excel_data(input_path, output_path):
    """Full cleaning pipeline: load → clean → validate → export."""

    # Load
    df = pd.read_excel(input_path)
    print(f"Loaded: {len(df)} rows, {len(df.columns)} columns")

    # Standardise columns
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

    # Strip whitespace from strings
    str_cols = df.select_dtypes(include="object").columns
    for col in str_cols:
        df[col] = df[col].astype(str).str.strip()

    # Replace placeholder values with NaN
    df = df.replace({"nan": np.nan, "": np.nan, "N/A": np.nan, "n/a": np.nan, "None": np.nan})

    # Standardise text casing
    for col in ["region", "product", "rep_name"]:
        if col in df.columns:
            df[col] = df[col].str.title()

    # Parse dates
    if "date" in df.columns:
        df["date"] = pd.to_datetime(df["date"], errors="coerce")

    # Fix numeric columns
    for col in ["qty", "revenue"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")
            df.loc[df[col] < 0, col] = np.nan

    # Handle missing values
    df = df.dropna(subset=["date", "region", "product"])
    df["qty"] = df["qty"].fillna(0).astype(int)
    df["rep_name"] = df["rep_name"].fillna("Unknown")

    # Remove duplicates
    df = df.drop_duplicates()
    df = df.drop_duplicates(
        subset=["date", "region", "product", "rep_name"], keep="last"
    )

    # Sort
    df = df.sort_values(["date", "region"]).reset_index(drop=True)

    # Export
    df.to_excel(output_path, index=False)
    print(f"Cleaned: {len(df)} rows exported to {output_path}")

    return df


if __name__ == "__main__":
    clean_df = clean_excel_data("messy_sales.xlsx", "clean_sales.xlsx")

Before and After

Issue	Before	After
Total rows	8	5 (duplicates + invalid removed)
Date formats	Mixed (`2026-03-01`, `03/01/2026`)	Consistent (`datetime64`)
Region casing	`north`, `NORTH`, `North`	`North`
Missing quantities	Empty cells	0
Invalid revenue	`N/A`	`NaN` (excluded from calculations)
Duplicates	2 exact + 1 business-key	0

Common Patterns Reference

Problem	pandas solution
Strip whitespace	`df[col].str.strip()`
Fix casing	`df[col].str.title()` or `.str.lower()`
Parse dates	`pd.to_datetime(col, errors="coerce")`
Fix numbers	`pd.to_numeric(col, errors="coerce")`
Fill blanks	`df[col].fillna(value)`
Remove duplicates	`df.drop_duplicates(subset=[...])`
Replace values	`df.replace({"old": "new"})`
Drop invalid rows	`df.dropna(subset=[...])`
Validate range	`df[col].between(min, max)`

Next Steps

Clean data is the prerequisite for everything — dashboards, reports, analysis, machine learning. Without it, every downstream output is unreliable.

This cleaning pipeline is reusable. Point it at any Excel file, adjust the column names, and the same patterns apply.

For building dashboards from cleaned data, see How to Build a Data Dashboard Without Manual Excel Work. For designing the full pipeline architecture, see How to Design Data Pipelines for Reliable Reporting Systems.

Data & dashboard services include data cleaning as the first step in every pipeline build.

Get in touch to discuss cleaning and automating your data workflows.

clean messy excel data python python data cleaning pandas clean excel fix messy spreadsheet python remove duplicates python handle missing values pandas data cleaning pipeline python excel data quality clean csv python pandas data wrangling python

Enjoyed this article?

Get notified when I publish new articles on automation, ecommerce, and data engineering.

Get in touch

Data & Dashboards

Python Polars vs Pandas: Performance Benchmarks with Real Data

Data & Dashboards

Ecommerce Data Pipeline: Reporting Architecture That Scales

Data & Dashboards

How to Build Dashboards Without BI Tools: Python + HTML

Who This Is For

The Messy Dataset

Step 1: Load and Inspect

Data Quality Report

Step 2: Standardise Column Names

Step 3: Fix Whitespace

Step 4: Standardise Text Casing

Step 5: Parse Dates

Step 6: Fix Numeric Columns

Step 7: Handle Missing Values

Choosing a Strategy

Step 8: Remove Duplicates

Step 9: Validate the Result

The Complete Cleaning Pipeline

Before and After

Common Patterns Reference

Next Steps

Related Articles