How to Clean Messy Excel Data Using Python

·7 min read·Data & Dashboards

A step-by-step guide to cleaning common Excel data problems — duplicates, missing values, inconsistent formatting, merged cells — using pandas, with a working example on a real messy dataset.

How to Clean Messy Excel Data Using Python

Every data pipeline starts with the same problem: the data is messy.

Duplicated rows, missing values, inconsistent formatting, random whitespace, mixed date formats, merged cells — these are not exceptions. They are the normal state of Excel data in the real world.

This guide walks through cleaning all of these issues with pandas, step by step, using a dataset that has every common problem.

# The Messy Dataset

Here is what real messy data looks like. Save this as messy_sales.xlsx:

Date Region Product Qty Revenue Rep Name
2026-03-01 North Widget Pro 150 4500 John Smith
03/01/2026 north Widget Pro 150 4500.00 John Smith
2026-03-02 South widget lite 1800 Jane Doe
2026-03-02 South Widget Lite -5 jane doe
2026-03-03 NORTH Widget Pro 175 5250 John smith
West Widget Pro 200 6000 Bob Wilson
2026-03-04 East Widget Lite 120 N/A
2026-03-05 North Widget Pro 160 4800.00 John Smith

Problems visible:

  1. Duplicate rows (rows 1 and 2)
  2. Inconsistent date formats (2026-03-01 vs 03/01/2026)
  3. Inconsistent casing (north, NORTH, North)
  4. Missing values (empty cells)
  5. Invalid values (-5 quantity, N/A revenue)
  6. Leading/trailing whitespace
  7. Inconsistent product names (widget lite vs Widget Lite)
  8. Inconsistent rep names (John Smith vs John smith)

# Step 1: Load and Inspect

python
import pandas as pd
import numpy as np

# Load
df = pd.read_excel("messy_sales.xlsx")

# Inspect
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nFirst rows:\n{df.head(10)}")

Always inspect before cleaning. You need to know what you are dealing with.

# Data Quality Report

python
def data_quality_report(df):
    """Generate a quick data quality summary."""
    report = pd.DataFrame({
        "column": df.columns,
        "dtype": df.dtypes.values,
        "non_null": df.count().values,
        "null_count": df.isnull().sum().values,
        "null_pct": (df.isnull().sum() / len(df) * 100).round(1).values,
        "unique": df.nunique().values,
        "duplicates": [df[col].duplicated().sum() for col in df.columns],
    })
    return report

print(data_quality_report(df).to_string(index=False))
text
    column   dtype  non_null  null_count  null_pct  unique  duplicates
      Date  object        7           1      12.5       5           2
    Region  object        8           0       0.0       5           3
   Product  object        8           0       0.0       4           4
       Qty  object        7           1      12.5       6           1
   Revenue  object        6           2      25.0       5           1
  Rep Name  object        7           1      12.5       5           2

# Step 2: Standardise Column Names

python
# Strip whitespace and normalise to snake_case
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
    .str.replace(r"[^\w]", "", regex=True)
)

print(df.columns.tolist())
# ['date', 'region', 'product', 'qty', 'revenue', 'rep_name']

# Step 3: Fix Whitespace

Whitespace is invisible and causes silent join failures:

python
# Strip all string columns
str_cols = df.select_dtypes(include="object").columns
for col in str_cols:
    df[col] = df[col].astype(str).str.strip()

# Replace 'nan' strings (from NaN → str conversion) back to NaN
df = df.replace({"nan": np.nan, "": np.nan, "N/A": np.nan, "n/a": np.nan, "None": np.nan})

# Step 4: Standardise Text Casing

python
# Title case for names and products
df["region"] = df["region"].str.title()
df["product"] = df["product"].str.title()
df["rep_name"] = df["rep_name"].str.title()

print(df[["region", "product", "rep_name"]].drop_duplicates())
text
  region     product     rep_name
   North  Widget Pro   John Smith
   South  Widget Lite    Jane Doe
    West  Widget Pro   Bob Wilson
    East  Widget Lite          NaN

Consistent. No more north vs NORTH vs North.

# Step 5: Parse Dates

Mixed date formats are one of the most common issues:

python
# pandas handles multiple formats automatically when errors='coerce'
df["date"] = pd.to_datetime(df["date"], errors="coerce", dayfirst=False)

# Check what could not be parsed
unparsed = df[df["date"].isna()]
if len(unparsed) > 0:
    print(f"Warning: {len(unparsed)} dates could not be parsed")
    print(unparsed)

# Step 6: Fix Numeric Columns

python
# Convert to numeric, coercing errors (like 'N/A') to NaN
df["qty"] = pd.to_numeric(df["qty"], errors="coerce")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

# Handle invalid values
df.loc[df["qty"] < 0, "qty"] = np.nan  # Negative quantities are invalid
df.loc[df["revenue"] < 0, "revenue"] = np.nan  # Negative revenue invalid

# Step 7: Handle Missing Values

Different strategies for different columns:

python
# Strategy 1: Drop rows where critical fields are missing
critical_columns = ["date", "region", "product"]
before = len(df)
df = df.dropna(subset=critical_columns)
print(f"Dropped {before - len(df)} rows with missing critical fields")

# Strategy 2: Fill numeric columns with 0
df["qty"] = df["qty"].fillna(0).astype(int)

# Strategy 3: Fill text with 'Unknown'
df["rep_name"] = df["rep_name"].fillna("Unknown")

# Strategy 4: Leave revenue as NaN (better to have no data than wrong data)
# Will be excluded from calculations automatically

# Choosing a Strategy

Column type Strategy When to use
ID / Key fields Drop row Cannot aggregate without identity
Dates Drop or impute from context Critical for time-series
Quantities Fill with 0 or drop Zero is a valid business value
Revenue Leave as NaN Better to exclude than guess
Text/names Fill with "Unknown" Preserves row for aggregation

# Step 8: Remove Duplicates

python
# Exact duplicates
before = len(df)
df = df.drop_duplicates()
print(f"Removed {before - len(df)} exact duplicates")

# Business-key duplicates (same date + region + product + rep)
before = len(df)
df = df.drop_duplicates(subset=["date", "region", "product", "rep_name"], keep="last")
print(f"Removed {before - len(df)} business-key duplicates")

# Step 9: Validate the Result

python
def validate_clean_data(df):
    """Run validation checks on cleaned data."""
    checks = {
        "No null dates": df["date"].notna().all(),
        "No null regions": df["region"].notna().all(),
        "No null products": df["product"].notna().all(),
        "No negative quantities": (df["qty"] >= 0).all(),
        "No duplicate keys": not df.duplicated(subset=["date", "region", "product", "rep_name"]).any(),
        "Dates in valid range": df["date"].between("2020-01-01", "2030-12-31").all(),
        "Revenue non-negative (where present)": df["revenue"].dropna().ge(0).all(),
    }

    print("\nValidation Results:")
    all_passed = True
    for check, passed in checks.items():
        status = "" if passed else ""
        print(f"  {status} {check}")
        if not passed:
            all_passed = False

    return all_passed

validate_clean_data(df)
text
Validation Results:
  ✓ No null dates
  ✓ No null regions
  ✓ No null products
  ✓ No negative quantities
  ✓ No duplicate keys
  ✓ Dates in valid range
  ✓ Revenue non-negative (where present)

# The Complete Cleaning Pipeline

python
"""
Data Cleaning Pipeline
Reads messy Excel data, cleans all common issues, validates, and exports.
"""
import pandas as pd
import numpy as np
from datetime import datetime


def clean_excel_data(input_path, output_path):
    """Full cleaning pipeline: load → clean → validate → export."""

    # Load
    df = pd.read_excel(input_path)
    print(f"Loaded: {len(df)} rows, {len(df.columns)} columns")

    # Standardise columns
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

    # Strip whitespace from strings
    str_cols = df.select_dtypes(include="object").columns
    for col in str_cols:
        df[col] = df[col].astype(str).str.strip()

    # Replace placeholder values with NaN
    df = df.replace({"nan": np.nan, "": np.nan, "N/A": np.nan, "n/a": np.nan, "None": np.nan})

    # Standardise text casing
    for col in ["region", "product", "rep_name"]:
        if col in df.columns:
            df[col] = df[col].str.title()

    # Parse dates
    if "date" in df.columns:
        df["date"] = pd.to_datetime(df["date"], errors="coerce")

    # Fix numeric columns
    for col in ["qty", "revenue"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")
            df.loc[df[col] < 0, col] = np.nan

    # Handle missing values
    df = df.dropna(subset=["date", "region", "product"])
    df["qty"] = df["qty"].fillna(0).astype(int)
    df["rep_name"] = df["rep_name"].fillna("Unknown")

    # Remove duplicates
    df = df.drop_duplicates()
    df = df.drop_duplicates(
        subset=["date", "region", "product", "rep_name"], keep="last"
    )

    # Sort
    df = df.sort_values(["date", "region"]).reset_index(drop=True)

    # Export
    df.to_excel(output_path, index=False)
    print(f"Cleaned: {len(df)} rows exported to {output_path}")

    return df


if __name__ == "__main__":
    clean_df = clean_excel_data("messy_sales.xlsx", "clean_sales.xlsx")

# Before and After

Issue Before After
Total rows 8 5 (duplicates + invalid removed)
Date formats Mixed (2026-03-01, 03/01/2026) Consistent (datetime64)
Region casing north, NORTH, North North
Missing quantities Empty cells 0
Invalid revenue N/A NaN (excluded from calculations)
Duplicates 2 exact + 1 business-key 0

# Common Patterns Reference

Problem pandas solution
Strip whitespace df[col].str.strip()
Fix casing df[col].str.title() or .str.lower()
Parse dates pd.to_datetime(col, errors="coerce")
Fix numbers pd.to_numeric(col, errors="coerce")
Fill blanks df[col].fillna(value)
Remove duplicates df.drop_duplicates(subset=[...])
Replace values df.replace({"old": "new"})
Drop invalid rows df.dropna(subset=[...])
Validate range df[col].between(min, max)

# Next Steps

Clean data is the prerequisite for everything — dashboards, reports, analysis, machine learning. Without it, every downstream output is unreliable.

This cleaning pipeline is reusable. Point it at any Excel file, adjust the column names, and the same patterns apply.

For building dashboards from cleaned data, see How to Build a Data Dashboard Without Manual Excel Work. For designing the full pipeline architecture, see How to Design Data Pipelines for Reliable Reporting Systems.

Data & dashboard services include data cleaning as the first step in every pipeline build.

Get in touch to discuss cleaning and automating your data workflows.

Enjoyed this article?

Get notified when I publish new articles on automation, ecommerce, and data engineering.

clean messy excel data pythonpython data cleaningpandas clean excelfix messy spreadsheet pythonremove duplicates pythonhandle missing values pandasdata cleaning pipelinepython excel data qualityclean csv python pandasdata wrangling python