Python CSV cleaner
Build a simple CSV cleaner web app with Streamlit + Pandas
Meta description suggestion: Turn a small Pandas cleaning script into a Streamlit web app: upload any CSV, remove empty rows and duplicates, preview the result, and download a cleaned file in seconds.
Why I built this
CSV files are everywhere. They’re also a little chaotic on a good day. You’ll open one and find blank rows, repeated records, odd spacing, or columns that look fine until your import tool refuses them.
I wanted a tiny “do the obvious cleanup” button that works for any CSV: upload → clean → preview → download. No hunting for the right Excel filter, no redoing the same steps every week.
What this app does
The app gives you a few practical cleaning options that cover the most common CSV headaches:
- Remove empty rows (rows where every value is blank)
- Remove duplicate rows (or duplicates based on selected columns)
- Trim whitespace (so “ABC ” and “ABC” don’t behave like different values)
- Optional column name normalization (lowercase + underscores)
It also shows a quick before/after preview so you can sanity-check the result before downloading.
The solution: Streamlit for the UI, Pandas for the cleanup
Streamlit makes it easy to build a little web interface without building a whole web interface. Pandas handles the data transformations. The end result is a single Python file you can run locally or deploy somewhere internal.
Annotated code
"""
CSV Cleaner (Streamlit + Pandas)
Upload any CSV, apply common cleaning steps, preview changes,
then download a cleaned CSV.
Run:
pip install streamlit pandas
streamlit run app.py
"""
import io
import re
from datetime import datetime
import pandas as pd
import streamlit as st
def safe_filename(name: str) -> str:
"""Clean up a filename so downloads behave nicely across OSes."""
name = name.strip()
name = re.sub(r"[^a-zA-Z0-9._-]+", "_", name)
return name or "cleaned.csv"
@st.cache_data(show_spinner=False)
def load_csv(file_bytes: bytes, encoding: str, sep: str) -> pd.DataFrame:
"""
Read CSV bytes into a DataFrame.
Cached so you can toggle options without re-reading the file.
"""
buffer = io.BytesIO(file_bytes)
return pd.read_csv(
buffer,
encoding=encoding,
sep=sep,
on_bad_lines="skip",
dtype=str, # keep everything as text to avoid type surprises
)
def clean_df(df: pd.DataFrame,
drop_empty_rows: bool,
drop_duplicate_rows: bool,
trim_whitespace: bool,
normalize_colnames: bool,
subset_for_dupes=None) -> pd.DataFrame:
"""Apply selected cleaning steps in a predictable order."""
cleaned = df.copy()
if normalize_colnames:
cleaned.columns = [
re.sub(r"\s+", " ", str(c).strip()).lower().replace(" ", "_")
for c in cleaned.columns
]
if trim_whitespace:
cleaned = cleaned.applymap(lambda x: x.strip() if isinstance(x, str) else x)
cleaned = cleaned.replace({"": pd.NA})
if drop_empty_rows:
cleaned = cleaned.dropna(how="all")
if drop_duplicate_rows:
if subset_for_dupes:
cleaned = cleaned.drop_duplicates(subset=subset_for_dupes, keep="first")
else:
cleaned = cleaned.drop_duplicates(keep="first")
return cleaned
def to_csv_bytes(df: pd.DataFrame) -> bytes:
"""Convert DataFrame to UTF-8 CSV bytes for download."""
return df.to_csv(index=False, lineterminator="\n").encode("utf-8")
st.set_page_config(page_title="CSV Cleaner", page_icon="🧼")
st.title("🧼 CSV cleaner")
uploaded_file = st.file_uploader("Upload a CSV file", type=["csv"])
if not uploaded_file:
st.info("Upload a CSV to get started.")
st.stop()
encoding = st.selectbox("Encoding", ["utf-8", "utf-8-sig", "latin-1"], index=1)
sep = st.selectbox("Separator", [",", ";", "\t", "|"], index=0)
file_bytes = uploaded_file.getvalue()
df = load_csv(file_bytes, encoding=encoding, sep=sep)
drop_empty_rows = st.checkbox("Remove empty rows", value=True)
drop_duplicate_rows = st.checkbox("Remove duplicate rows", value=True)
trim_whitespace = st.checkbox("Trim whitespace", value=True)
normalize_colnames = st.checkbox("Normalize column names", value=False)
subset_for_dupes = None
if drop_duplicate_rows and len(df.columns) > 1:
subset_for_dupes = st.multiselect("Columns for duplicate detection", list(df.columns))
df_cleaned = clean_df(
df,
drop_empty_rows=drop_empty_rows,
drop_duplicate_rows=drop_duplicate_rows,
trim_whitespace=trim_whitespace,
normalize_colnames=normalize_colnames,
subset_for_dupes=subset_for_dupes or None,
)
st.write(f"Rows: {len(df):,} → {len(df_cleaned):,}")
st.dataframe(df.head(20), use_container_width=True)
st.dataframe(df_cleaned.head(20), use_container_width=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
base_name = uploaded_file.name.rsplit(".", 1)[0]
download_name = safe_filename(f"{base_name}_cleaned_{timestamp}.csv")
st.download_button(
"Download cleaned CSV",
data=to_csv_bytes(df_cleaned),
file_name=download_name,
mime="text/csv",
)
How to run it
- Create a file named
app.pyand paste the script in. - Install the packages:
pip install streamlit pandas - Start the app:
streamlit run app.py - Upload a CSV, choose your cleaning options, and download the cleaned file.
What you get out of it (benefits)
- Repeatable cleanup: same steps every time, fewer weird one-off fixes.
- Fast feedback: preview before/after so you can catch “oops” moments early.
- Less tool-hopping: no bouncing between spreadsheets, scripts, and random online converters.
- Friendly output: clean CSV download with a timestamped filename.
Common ways people use this
- Cleaning exports before importing into another system
- Removing accidental duplicates after copy/paste merges
- Normalizing text fields before doing any analysis
- Quick “is this file sane?” checks using the preview
Pros and cons
Pros
- Small, readable code (easy to tweak for your workflow)
- Works for almost any CSV without special setup
- Does the boring cleanup fast
- You can run it locally if you prefer keeping data on your machine
Cons
- Huge files can be slow: Pandas loads the full CSV into memory.
- Duplicates are situational: “duplicate row” isn’t always “bad row,” so choose your de-dupe columns carefully.
- Skipping bad lines is a trade-off: it keeps things moving, but it can hide formatting issues unless you add reporting.
- Everything loads as text: great for safety, but you might want typed columns later (numbers, dates, etc.).
Easy upgrades if you want more
- Add a cleaning summary: “X empty rows removed, Y duplicates removed” (people love receipts).
- Required column checks: show a warning if key columns are missing.
- Column mapping: rename columns to a standard format for downstream tools.
- Big file support: switch to chunked processing or use Polars for speed.
Comments
Post a Comment