Build a simple CSV cleaner web app with Streamlit + Pandas

Meta description suggestion: Turn a small Pandas cleaning script into a Streamlit web app: upload any CSV, remove empty rows and duplicates, preview the result, and download a cleaned file in seconds.

Why I built this

CSV files are everywhere. They’re also a little chaotic on a good day. You’ll open one and find blank rows, repeated records, odd spacing, or columns that look fine until your import tool refuses them.

I wanted a tiny “do the obvious cleanup” button that works for any CSV: upload → clean → preview → download. No hunting for the right Excel filter, no redoing the same steps every week.

What this app does

The app gives you a few practical cleaning options that cover the most common CSV headaches:

Remove empty rows (rows where every value is blank)
Remove duplicate rows (or duplicates based on selected columns)
Trim whitespace (so “ABC ” and “ABC” don’t behave like different values)
Optional column name normalization (lowercase + underscores)

It also shows a quick before/after preview so you can sanity-check the result before downloading.

The solution: Streamlit for the UI, Pandas for the cleanup

Streamlit makes it easy to build a little web interface without building a whole web interface. Pandas handles the data transformations. The end result is a single Python file you can run locally or deploy somewhere internal.

Annotated code

"""
CSV Cleaner (Streamlit + Pandas)

Upload any CSV, apply common cleaning steps, preview changes,
then download a cleaned CSV.

Run:
  pip install streamlit pandas
  streamlit run app.py
"""

import io
import re
from datetime import datetime

import pandas as pd
import streamlit as st


def safe_filename(name: str) -> str:
    """Clean up a filename so downloads behave nicely across OSes."""
    name = name.strip()
    name = re.sub(r"[^a-zA-Z0-9._-]+", "_", name)
    return name or "cleaned.csv"


@st.cache_data(show_spinner=False)
def load_csv(file_bytes: bytes, encoding: str, sep: str) -> pd.DataFrame:
    """
    Read CSV bytes into a DataFrame.
    Cached so you can toggle options without re-reading the file.
    """
    buffer = io.BytesIO(file_bytes)
    return pd.read_csv(
        buffer,
        encoding=encoding,
        sep=sep,
        on_bad_lines="skip",
        dtype=str,  # keep everything as text to avoid type surprises
    )


def clean_df(df: pd.DataFrame,
             drop_empty_rows: bool,
             drop_duplicate_rows: bool,
             trim_whitespace: bool,
             normalize_colnames: bool,
             subset_for_dupes=None) -> pd.DataFrame:
    """Apply selected cleaning steps in a predictable order."""
    cleaned = df.copy()

    if normalize_colnames:
        cleaned.columns = [
            re.sub(r"\s+", " ", str(c).strip()).lower().replace(" ", "_")
            for c in cleaned.columns
        ]

    if trim_whitespace:
        cleaned = cleaned.applymap(lambda x: x.strip() if isinstance(x, str) else x)
        cleaned = cleaned.replace({"": pd.NA})

    if drop_empty_rows:
        cleaned = cleaned.dropna(how="all")

    if drop_duplicate_rows:
        if subset_for_dupes:
            cleaned = cleaned.drop_duplicates(subset=subset_for_dupes, keep="first")
        else:
            cleaned = cleaned.drop_duplicates(keep="first")

    return cleaned


def to_csv_bytes(df: pd.DataFrame) -> bytes:
    """Convert DataFrame to UTF-8 CSV bytes for download."""
    return df.to_csv(index=False, lineterminator="\n").encode("utf-8")


st.set_page_config(page_title="CSV Cleaner", page_icon="🧼")
st.title("🧼 CSV cleaner")

uploaded_file = st.file_uploader("Upload a CSV file", type=["csv"])
if not uploaded_file:
    st.info("Upload a CSV to get started.")
    st.stop()

encoding = st.selectbox("Encoding", ["utf-8", "utf-8-sig", "latin-1"], index=1)
sep = st.selectbox("Separator", [",", ";", "\t", "|"], index=0)

file_bytes = uploaded_file.getvalue()
df = load_csv(file_bytes, encoding=encoding, sep=sep)

drop_empty_rows = st.checkbox("Remove empty rows", value=True)
drop_duplicate_rows = st.checkbox("Remove duplicate rows", value=True)
trim_whitespace = st.checkbox("Trim whitespace", value=True)
normalize_colnames = st.checkbox("Normalize column names", value=False)

subset_for_dupes = None
if drop_duplicate_rows and len(df.columns) > 1:
    subset_for_dupes = st.multiselect("Columns for duplicate detection", list(df.columns))

df_cleaned = clean_df(
    df,
    drop_empty_rows=drop_empty_rows,
    drop_duplicate_rows=drop_duplicate_rows,
    trim_whitespace=trim_whitespace,
    normalize_colnames=normalize_colnames,
    subset_for_dupes=subset_for_dupes or None,
)

st.write(f"Rows: {len(df):,} → {len(df_cleaned):,}")

st.dataframe(df.head(20), use_container_width=True)
st.dataframe(df_cleaned.head(20), use_container_width=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
base_name = uploaded_file.name.rsplit(".", 1)[0]
download_name = safe_filename(f"{base_name}_cleaned_{timestamp}.csv")

st.download_button(
    "Download cleaned CSV",
    data=to_csv_bytes(df_cleaned),
    file_name=download_name,
    mime="text/csv",
)

How to run it

Create a file named app.py and paste the script in.
Install the packages:
```
pip install streamlit pandas
```
Start the app:
```
streamlit run app.py
```
Upload a CSV, choose your cleaning options, and download the cleaned file.

What you get out of it (benefits)

Repeatable cleanup: same steps every time, fewer weird one-off fixes.
Fast feedback: preview before/after so you can catch “oops” moments early.
Less tool-hopping: no bouncing between spreadsheets, scripts, and random online converters.
Friendly output: clean CSV download with a timestamped filename.

Common ways people use this

Cleaning exports before importing into another system
Removing accidental duplicates after copy/paste merges
Normalizing text fields before doing any analysis
Quick “is this file sane?” checks using the preview

Pros and cons

Pros

Small, readable code (easy to tweak for your workflow)
Works for almost any CSV without special setup
Does the boring cleanup fast
You can run it locally if you prefer keeping data on your machine

Cons

Huge files can be slow: Pandas loads the full CSV into memory.
Duplicates are situational: “duplicate row” isn’t always “bad row,” so choose your de-dupe columns carefully.
Skipping bad lines is a trade-off: it keeps things moving, but it can hide formatting issues unless you add reporting.
Everything loads as text: great for safety, but you might want typed columns later (numbers, dates, etc.).

Easy upgrades if you want more

Add a cleaning summary: “X empty rows removed, Y duplicates removed” (people love receipts).
Required column checks: show a warning if key columns are missing.
Column mapping: rename columns to a standard format for downstream tools.
Big file support: switch to chunked processing or use Polars for speed.

Search This Blog

Another BI & Programming Blog - Jason Yousef

Python CSV cleaner

Build a simple CSV cleaner web app with Streamlit + Pandas

Why I built this

What this app does

The solution: Streamlit for the UI, Pandas for the cleanup

Annotated code

How to run it

What you get out of it (benefits)

Common ways people use this

Pros and cons

Pros

Cons

Easy upgrades if you want more

Comments

Post a Comment

Popular posts from this blog

Open SSRS Linked URLS in a new window

SSRS Font Weight expressions

Power query for Excel