Pydantic: your bouncer for messy data (with real examples)

If you’ve ever had a pipeline blow up because "quantity" showed up as "two", you already get why validation matters.

Meta description suggestion: Learn how to use Pydantic (v2) to validate API inputs, clean data-pipeline rows, load typed settings from environment variables, and produce clear error messages—with practical Python examples.

Why validation saves your sanity

Pydantic is a guardrail for input data. You describe the shape using Python type hints, and Pydantic checks it at runtime. When something’s off, you get a clean ValidationError that tells you exactly what failed.

Plain-English summary:

You write a model that says “this is what valid data looks like.” Pydantic makes sure the real world behaves (or at least complains loudly and specifically when it doesn’t).

The problem scenario

You’re dealing with inputs from one of these usual suspects:

API request bodies (JSON payloads)
CSV exports from “that system”
Config files and environment variables
Events/messages from queues or logs

And the data is… let’s call it “creative.” Missing fields, wrong types, extra keys, dates as strings, numbers that show up as text, and so on.

Where teams get burned:

Bad data rarely fails at the boundary. It sneaks in, then explodes three functions later when you’re least emotionally prepared.

Install

pip install pydantic
pip install pydantic-settings  # optional, for typed settings

Pydantic v2 uses model_validate() / model_dump(). Under the hood, the core validation is implemented in Rust, so it’s both strict and fast.

Examples you can copy/paste

Tip:

Try each example in a scratch file and run it. The error messages are half the value here—they teach you what broke.

1) Validate API-like inputs with a model

Define a schema using BaseModel. Add constraints with Field(). Decide what to do with extra keys. For APIs, extra="forbid" is a solid default: unexpected fields don’t quietly sneak into your system.

from datetime import datetime
from pydantic import BaseModel, ConfigDict, Field, ValidationError, field_validator
from pydantic.types import EmailStr

class UserCreate(BaseModel):
    model_config = ConfigDict(extra="forbid")

    email: EmailStr
    age: int = Field(ge=13, le=120)
    display_name: str = Field(min_length=2, max_length=40)
    signup_ts: datetime = Field(default_factory=datetime.utcnow)

    @field_validator("display_name")
    @classmethod
    def no_blank_names(cls, v: str) -> str:
        v2 = v.strip()
        if not v2:
            raise ValueError("display_name cannot be blank")
        return v2

good = {"email": "sam@example.com", "age": 29, "display_name": " Sam "}
user = UserCreate.model_validate(good)
print(user.model_dump())

bad = {"email": "nope", "age": 9, "display_name": "   ", "admin": True}
try:
    UserCreate.model_validate(bad)
except ValidationError as e:
    print(e)

2) Cross-field rules (the “business logic” part)

Some rules need multiple fields: date ranges, min/max pairs, or “if A is set then B must be set.” That’s what model_validator is for.

from datetime import datetime
from pydantic import BaseModel, ValidationError, model_validator

class DateRange(BaseModel):
    start: datetime
    end: datetime

    @model_validator(mode="after")
    def validate_range(self) -> "DateRange":
        if self.end <= self.start:
            raise ValueError("end must be after start")
        return self

try:
    DateRange.model_validate({"start": "2025-01-02T10:00:00", "end": "2025-01-02T09:00:00"})
except ValidationError as e:
    print(e)

3) Data pipeline validation (CSV/JSON rows)

This is the sweet spot for analytics and ETL: validate each row, keep the good ones, and collect structured errors for the bad ones. You can also make fields strict when you want problems to be loud.

from pydantic import BaseModel, ConfigDict, Field, ValidationError
from pydantic.types import StrictInt

class SalesRow(BaseModel):
    model_config = ConfigDict(extra="ignore")
    order_id: StrictInt
    sku: str = Field(min_length=3)
    quantity: int = Field(gt=0)
    unit_price: float = Field(gt=0)

def validate_rows(rows):
    valid, errors = [], []
    for i, raw in enumerate(rows):
        try:
            valid.append(SalesRow.model_validate(raw))
        except ValidationError as e:
            errors.append({"row_index": i, "issues": e.errors(), "input": raw})
    return valid, errors

Why extra="ignore" here?

Pipelines often get “bonus columns” over time. Ignoring unknown keys keeps you resilient while you decide what to do with them. For APIs, I’m stricter.

4) Typed settings from environment variables

Pydantic Settings lets you define config once and load it from environment variables or a .env file. It’s a great way to stop “config as string soup.”

from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict

class AppSettings(BaseSettings):
    model_config = SettingsConfigDict(env_file=".env", extra="ignore")

    app_name: str = "MyApp"
    database_url: str
    timeout_seconds: int = Field(default=10, ge=1, le=120)

settings = AppSettings()
print(settings.model_dump())

Good default for APIs

extra="forbid"
Reject surprises early
Cleaner contracts

Good default for pipelines

extra="ignore"
More resilient to upstream changes
Collect errors without stopping everything

Benefits

Fast feedback: clear ValidationError messages and structured details.
Fewer mystery bugs: stop bad data at the door instead of debugging it later.
Better contracts: models can produce JSON Schema for docs and interoperability.
Speed: Pydantic v2 validation is powered by a Rust core.
Cleaner code: rules live next to the model, not scattered across your project.

Practical use cases

API validation: validate request bodies and return friendly errors automatically (FastAPI does this well).
Analytics pipelines: validate rows, keep good data, log structured errors for the rest.
Config management: typed settings from environment variables, with bounds checking.
Data contracts: generate JSON Schema to share “what we expect” with other teams.

Search This Blog

Another BI & Programming Blog - Jason Yousef