How to Clean Messy Excel Data Using Python

· 7 min read · Data & Dashboards

A step-by-step guide to cleaning common Excel data problems — duplicates, missing values, inconsistent formatting, merged cells — using pandas, with a working example on a real messy dataset.

How to Clean Messy Excel Data Using Python

Every data pipeline starts with the same problem: the data is messy.

Duplicated rows, missing values, inconsistent formatting, random whitespace, mixed date formats, merged cells — these are not exceptions. They are the normal state of Excel data in the real world.

This guide walks through cleaning all of these issues with pandas, step by step, using a dataset that has every common problem.

Who This Is For

  • Analysts who receive messy spreadsheets from clients or internal teams and spend hours fixing them
  • Business users who know their Excel files are inconsistent but do not have a repeatable way to clean them
  • Vibe coders building automations that ingest data from non-technical users who format things inconsistently
  • Anyone who has ever spent a morning manually removing duplicates, fixing dates, or standardising column names

This is one of the most beginner-friendly guides in the series. If you can open a terminal and run a Python script, you can follow along. Every step is explained with before/after examples.

The Messy Dataset

Here is what real messy data looks like. Save this as messy_sales.xlsx:

DateRegionProductQtyRevenueRep Name
2026-03-01NorthWidget Pro1504500John Smith
03/01/2026northWidget Pro1504500.00John Smith
2026-03-02Southwidget lite1800Jane Doe
2026-03-02SouthWidget Lite-5jane doe
2026-03-03NORTHWidget Pro1755250John smith
WestWidget Pro2006000Bob Wilson
2026-03-04EastWidget Lite120N/A
2026-03-05NorthWidget Pro1604800.00John Smith

Problems visible:

  1. Duplicate rows (rows 1 and 2)
  2. Inconsistent date formats (2026-03-01 vs 03/01/2026)
  3. Inconsistent casing (north, NORTH, North)
  4. Missing values (empty cells)
  5. Invalid values (-5 quantity, N/A revenue)
  6. Leading/trailing whitespace
  7. Inconsistent product names (widget lite vs Widget Lite)
  8. Inconsistent rep names (John Smith vs John smith)

Step 1: Load and Inspect

import pandas as pd
import numpy as np

# Load
df = pd.read_excel("messy_sales.xlsx")

# Inspect
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nFirst rows:\n{df.head(10)}")

Always inspect before cleaning. You need to know what you are dealing with.

Data Quality Report

def data_quality_report(df):
    """Generate a quick data quality summary."""
    report = pd.DataFrame({
        "column": df.columns,
        "dtype": df.dtypes.values,
        "non_null": df.count().values,
        "null_count": df.isnull().sum().values,
        "null_pct": (df.isnull().sum() / len(df) * 100).round(1).values,
        "unique": df.nunique().values,
        "duplicates": [df[col].duplicated().sum() for col in df.columns],
    })
    return report

print(data_quality_report(df).to_string(index=False))
    column   dtype  non_null  null_count  null_pct  unique  duplicates
      Date  object        7           1      12.5       5           2
    Region  object        8           0       0.0       5           3
   Product  object        8           0       0.0       4           4
       Qty  object        7           1      12.5       6           1
   Revenue  object        6           2      25.0       5           1
  Rep Name  object        7           1      12.5       5           2

Step 2: Standardise Column Names

# Strip whitespace and normalise to snake_case
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
    .str.replace(r"[^\w]", "", regex=True)
)

print(df.columns.tolist())
# ['date', 'region', 'product', 'qty', 'revenue', 'rep_name']

Step 3: Fix Whitespace

Whitespace is invisible and causes silent join failures:

# Strip all string columns
str_cols = df.select_dtypes(include="object").columns
for col in str_cols:
    df[col] = df[col].astype(str).str.strip()

# Replace 'nan' strings (from NaN → str conversion) back to NaN
df = df.replace({"nan": np.nan, "": np.nan, "N/A": np.nan, "n/a": np.nan, "None": np.nan})

Step 4: Standardise Text Casing

# Title case for names and products
df["region"] = df["region"].str.title()
df["product"] = df["product"].str.title()
df["rep_name"] = df["rep_name"].str.title()

print(df[["region", "product", "rep_name"]].drop_duplicates())
  region     product     rep_name
   North  Widget Pro   John Smith
   South  Widget Lite    Jane Doe
    West  Widget Pro   Bob Wilson
    East  Widget Lite          NaN

Consistent. No more north vs NORTH vs North.

Step 5: Parse Dates

Mixed date formats are one of the most common issues:

# pandas handles multiple formats automatically when errors='coerce'
df["date"] = pd.to_datetime(df["date"], errors="coerce", dayfirst=False)

# Check what could not be parsed
unparsed = df[df["date"].isna()]
if len(unparsed) > 0:
    print(f"Warning: {len(unparsed)} dates could not be parsed")
    print(unparsed)

Step 6: Fix Numeric Columns

# Convert to numeric, coercing errors (like 'N/A') to NaN
df["qty"] = pd.to_numeric(df["qty"], errors="coerce")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

# Handle invalid values
df.loc[df["qty"] < 0, "qty"] = np.nan  # Negative quantities are invalid
df.loc[df["revenue"] < 0, "revenue"] = np.nan  # Negative revenue invalid

Step 7: Handle Missing Values

Different strategies for different columns:

# Strategy 1: Drop rows where critical fields are missing
critical_columns = ["date", "region", "product"]
before = len(df)
df = df.dropna(subset=critical_columns)
print(f"Dropped {before - len(df)} rows with missing critical fields")

# Strategy 2: Fill numeric columns with 0
df["qty"] = df["qty"].fillna(0).astype(int)

# Strategy 3: Fill text with 'Unknown'
df["rep_name"] = df["rep_name"].fillna("Unknown")

# Strategy 4: Leave revenue as NaN (better to have no data than wrong data)
# Will be excluded from calculations automatically

Choosing a Strategy

Column typeStrategyWhen to use
ID / Key fieldsDrop rowCannot aggregate without identity
DatesDrop or impute from contextCritical for time-series
QuantitiesFill with 0 or dropZero is a valid business value
RevenueLeave as NaNBetter to exclude than guess
Text/namesFill with “Unknown”Preserves row for aggregation

Step 8: Remove Duplicates

# Exact duplicates
before = len(df)
df = df.drop_duplicates()
print(f"Removed {before - len(df)} exact duplicates")

# Business-key duplicates (same date + region + product + rep)
before = len(df)
df = df.drop_duplicates(subset=["date", "region", "product", "rep_name"], keep="last")
print(f"Removed {before - len(df)} business-key duplicates")

Step 9: Validate the Result

def validate_clean_data(df):
    """Run validation checks on cleaned data."""
    checks = {
        "No null dates": df["date"].notna().all(),
        "No null regions": df["region"].notna().all(),
        "No null products": df["product"].notna().all(),
        "No negative quantities": (df["qty"] >= 0).all(),
        "No duplicate keys": not df.duplicated(subset=["date", "region", "product", "rep_name"]).any(),
        "Dates in valid range": df["date"].between("2020-01-01", "2030-12-31").all(),
        "Revenue non-negative (where present)": df["revenue"].dropna().ge(0).all(),
    }

    print("\nValidation Results:")
    all_passed = True
    for check, passed in checks.items():
        status = "" if passed else ""
        print(f"  {status} {check}")
        if not passed:
            all_passed = False

    return all_passed

validate_clean_data(df)
Validation Results:
  ✓ No null dates
  ✓ No null regions
  ✓ No null products
  ✓ No negative quantities
  ✓ No duplicate keys
  ✓ Dates in valid range
  ✓ Revenue non-negative (where present)

The Complete Cleaning Pipeline

"""
Data Cleaning Pipeline
Reads messy Excel data, cleans all common issues, validates, and exports.
"""
import pandas as pd
import numpy as np
from datetime import datetime


def clean_excel_data(input_path, output_path):
    """Full cleaning pipeline: load → clean → validate → export."""

    # Load
    df = pd.read_excel(input_path)
    print(f"Loaded: {len(df)} rows, {len(df.columns)} columns")

    # Standardise columns
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

    # Strip whitespace from strings
    str_cols = df.select_dtypes(include="object").columns
    for col in str_cols:
        df[col] = df[col].astype(str).str.strip()

    # Replace placeholder values with NaN
    df = df.replace({"nan": np.nan, "": np.nan, "N/A": np.nan, "n/a": np.nan, "None": np.nan})

    # Standardise text casing
    for col in ["region", "product", "rep_name"]:
        if col in df.columns:
            df[col] = df[col].str.title()

    # Parse dates
    if "date" in df.columns:
        df["date"] = pd.to_datetime(df["date"], errors="coerce")

    # Fix numeric columns
    for col in ["qty", "revenue"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")
            df.loc[df[col] < 0, col] = np.nan

    # Handle missing values
    df = df.dropna(subset=["date", "region", "product"])
    df["qty"] = df["qty"].fillna(0).astype(int)
    df["rep_name"] = df["rep_name"].fillna("Unknown")

    # Remove duplicates
    df = df.drop_duplicates()
    df = df.drop_duplicates(
        subset=["date", "region", "product", "rep_name"], keep="last"
    )

    # Sort
    df = df.sort_values(["date", "region"]).reset_index(drop=True)

    # Export
    df.to_excel(output_path, index=False)
    print(f"Cleaned: {len(df)} rows exported to {output_path}")

    return df


if __name__ == "__main__":
    clean_df = clean_excel_data("messy_sales.xlsx", "clean_sales.xlsx")

Before and After

IssueBeforeAfter
Total rows85 (duplicates + invalid removed)
Date formatsMixed (2026-03-01, 03/01/2026)Consistent (datetime64)
Region casingnorth, NORTH, NorthNorth
Missing quantitiesEmpty cells0
Invalid revenueN/ANaN (excluded from calculations)
Duplicates2 exact + 1 business-key0

Common Patterns Reference

Problempandas solution
Strip whitespacedf[col].str.strip()
Fix casingdf[col].str.title() or .str.lower()
Parse datespd.to_datetime(col, errors="coerce")
Fix numberspd.to_numeric(col, errors="coerce")
Fill blanksdf[col].fillna(value)
Remove duplicatesdf.drop_duplicates(subset=[...])
Replace valuesdf.replace({"old": "new"})
Drop invalid rowsdf.dropna(subset=[...])
Validate rangedf[col].between(min, max)

Next Steps

Clean data is the prerequisite for everything — dashboards, reports, analysis, machine learning. Without it, every downstream output is unreliable.

This cleaning pipeline is reusable. Point it at any Excel file, adjust the column names, and the same patterns apply.

For building dashboards from cleaned data, see How to Build a Data Dashboard Without Manual Excel Work. For designing the full pipeline architecture, see How to Design Data Pipelines for Reliable Reporting Systems.

Data & dashboard services include data cleaning as the first step in every pipeline build.

Get in touch to discuss cleaning and automating your data workflows.

clean messy excel data python python data cleaning pandas clean excel fix messy spreadsheet python remove duplicates python handle missing values pandas data cleaning pipeline python excel data quality clean csv python pandas data wrangling python

Enjoyed this article?

Get notified when I publish new articles on automation, ecommerce, and data engineering.

Get in touch

Related Articles