Python CSV cleaner

Build a simple CSV cleaner web app with Streamlit + Pandas

Meta description suggestion: Turn a small Pandas cleaning script into a Streamlit web app: upload any CSV, remove empty rows and duplicates, preview the result, and download a cleaned file in seconds.

Why I built this

CSV files are everywhere. They’re also a little chaotic on a good day. You’ll open one and find blank rows, repeated records, odd spacing, or columns that look fine until your import tool refuses them.

I wanted a tiny “do the obvious cleanup” button that works for any CSV: upload → clean → preview → download. No hunting for the right Excel filter, no redoing the same steps every week.

What this app does

The app gives you a few practical cleaning options that cover the most common CSV headaches:

  • Remove empty rows (rows where every value is blank)
  • Remove duplicate rows (or duplicates based on selected columns)
  • Trim whitespace (so “ABC ” and “ABC” don’t behave like different values)
  • Optional column name normalization (lowercase + underscores)

It also shows a quick before/after preview so you can sanity-check the result before downloading.

The solution: Streamlit for the UI, Pandas for the cleanup

Streamlit makes it easy to build a little web interface without building a whole web interface. Pandas handles the data transformations. The end result is a single Python file you can run locally or deploy somewhere internal.

Annotated code

"""
CSV Cleaner (Streamlit + Pandas)

Upload any CSV, apply common cleaning steps, preview changes,
then download a cleaned CSV.

Run:
  pip install streamlit pandas
  streamlit run app.py
"""

import io
import re
from datetime import datetime

import pandas as pd
import streamlit as st


def safe_filename(name: str) -> str:
    """Clean up a filename so downloads behave nicely across OSes."""
    name = name.strip()
    name = re.sub(r"[^a-zA-Z0-9._-]+", "_", name)
    return name or "cleaned.csv"


@st.cache_data(show_spinner=False)
def load_csv(file_bytes: bytes, encoding: str, sep: str) -> pd.DataFrame:
    """
    Read CSV bytes into a DataFrame.
    Cached so you can toggle options without re-reading the file.
    """
    buffer = io.BytesIO(file_bytes)
    return pd.read_csv(
        buffer,
        encoding=encoding,
        sep=sep,
        on_bad_lines="skip",
        dtype=str,  # keep everything as text to avoid type surprises
    )


def clean_df(df: pd.DataFrame,
             drop_empty_rows: bool,
             drop_duplicate_rows: bool,
             trim_whitespace: bool,
             normalize_colnames: bool,
             subset_for_dupes=None) -> pd.DataFrame:
    """Apply selected cleaning steps in a predictable order."""
    cleaned = df.copy()

    if normalize_colnames:
        cleaned.columns = [
            re.sub(r"\s+", " ", str(c).strip()).lower().replace(" ", "_")
            for c in cleaned.columns
        ]

    if trim_whitespace:
        cleaned = cleaned.applymap(lambda x: x.strip() if isinstance(x, str) else x)
        cleaned = cleaned.replace({"": pd.NA})

    if drop_empty_rows:
        cleaned = cleaned.dropna(how="all")

    if drop_duplicate_rows:
        if subset_for_dupes:
            cleaned = cleaned.drop_duplicates(subset=subset_for_dupes, keep="first")
        else:
            cleaned = cleaned.drop_duplicates(keep="first")

    return cleaned


def to_csv_bytes(df: pd.DataFrame) -> bytes:
    """Convert DataFrame to UTF-8 CSV bytes for download."""
    return df.to_csv(index=False, lineterminator="\n").encode("utf-8")


st.set_page_config(page_title="CSV Cleaner", page_icon="🧼")
st.title("🧼 CSV cleaner")

uploaded_file = st.file_uploader("Upload a CSV file", type=["csv"])
if not uploaded_file:
    st.info("Upload a CSV to get started.")
    st.stop()

encoding = st.selectbox("Encoding", ["utf-8", "utf-8-sig", "latin-1"], index=1)
sep = st.selectbox("Separator", [",", ";", "\t", "|"], index=0)

file_bytes = uploaded_file.getvalue()
df = load_csv(file_bytes, encoding=encoding, sep=sep)

drop_empty_rows = st.checkbox("Remove empty rows", value=True)
drop_duplicate_rows = st.checkbox("Remove duplicate rows", value=True)
trim_whitespace = st.checkbox("Trim whitespace", value=True)
normalize_colnames = st.checkbox("Normalize column names", value=False)

subset_for_dupes = None
if drop_duplicate_rows and len(df.columns) > 1:
    subset_for_dupes = st.multiselect("Columns for duplicate detection", list(df.columns))

df_cleaned = clean_df(
    df,
    drop_empty_rows=drop_empty_rows,
    drop_duplicate_rows=drop_duplicate_rows,
    trim_whitespace=trim_whitespace,
    normalize_colnames=normalize_colnames,
    subset_for_dupes=subset_for_dupes or None,
)

st.write(f"Rows: {len(df):,} → {len(df_cleaned):,}")

st.dataframe(df.head(20), use_container_width=True)
st.dataframe(df_cleaned.head(20), use_container_width=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
base_name = uploaded_file.name.rsplit(".", 1)[0]
download_name = safe_filename(f"{base_name}_cleaned_{timestamp}.csv")

st.download_button(
    "Download cleaned CSV",
    data=to_csv_bytes(df_cleaned),
    file_name=download_name,
    mime="text/csv",
)

How to run it

  1. Create a file named app.py and paste the script in.
  2. Install the packages:
    pip install streamlit pandas
  3. Start the app:
    streamlit run app.py
  4. Upload a CSV, choose your cleaning options, and download the cleaned file.

What you get out of it (benefits)

  • Repeatable cleanup: same steps every time, fewer weird one-off fixes.
  • Fast feedback: preview before/after so you can catch “oops” moments early.
  • Less tool-hopping: no bouncing between spreadsheets, scripts, and random online converters.
  • Friendly output: clean CSV download with a timestamped filename.

Common ways people use this

  • Cleaning exports before importing into another system
  • Removing accidental duplicates after copy/paste merges
  • Normalizing text fields before doing any analysis
  • Quick “is this file sane?” checks using the preview

Pros and cons

Pros

  • Small, readable code (easy to tweak for your workflow)
  • Works for almost any CSV without special setup
  • Does the boring cleanup fast
  • You can run it locally if you prefer keeping data on your machine

Cons

  • Huge files can be slow: Pandas loads the full CSV into memory.
  • Duplicates are situational: “duplicate row” isn’t always “bad row,” so choose your de-dupe columns carefully.
  • Skipping bad lines is a trade-off: it keeps things moving, but it can hide formatting issues unless you add reporting.
  • Everything loads as text: great for safety, but you might want typed columns later (numbers, dates, etc.).

Easy upgrades if you want more

  • Add a cleaning summary: “X empty rows removed, Y duplicates removed” (people love receipts).
  • Required column checks: show a warning if key columns are missing.
  • Column mapping: rename columns to a standard format for downstream tools.
  • Big file support: switch to chunked processing or use Polars for speed.

Wrap-up

This app won’t solve every data problem, but it does crush the common “CSV mess” tasks without turning your day into a spreadsheet scavenger hunt. If you clean the same kind of file more than once, it’s probably worth making it a button.

Comments

Popular posts from this blog

Open SSRS Linked URLS in a new window

SSRS Font Weight expressions