Pydantic: your bouncer for messy data (with real examples)
Pydantic: your bouncer for messy data (with real examples)
Why validation saves your sanity
Pydantic is a guardrail for input data. You describe the shape using Python type hints,
and Pydantic checks it at runtime. When something’s off, you get a clean ValidationError
that tells you exactly what failed.
You write a model that says “this is what valid data looks like.” Pydantic makes sure the real world behaves (or at least complains loudly and specifically when it doesn’t).
The problem scenario
You’re dealing with inputs from one of these usual suspects:
- API request bodies (JSON payloads)
- CSV exports from “that system”
- Config files and environment variables
- Events/messages from queues or logs
And the data is… let’s call it “creative.” Missing fields, wrong types, extra keys, dates as strings, numbers that show up as text, and so on.
Bad data rarely fails at the boundary. It sneaks in, then explodes three functions later when you’re least emotionally prepared.
Install
pip install pydantic
pip install pydantic-settings # optional, for typed settings
Pydantic v2 uses model_validate() / model_dump(). Under the hood, the core validation
is implemented in Rust, so it’s both strict and fast.
Examples you can copy/paste
Try each example in a scratch file and run it. The error messages are half the value here—they teach you what broke.
1) Validate API-like inputs with a model
Define a schema using BaseModel. Add constraints with Field().
Decide what to do with extra keys. For APIs, extra="forbid" is a solid default:
unexpected fields don’t quietly sneak into your system.
from datetime import datetime
from pydantic import BaseModel, ConfigDict, Field, ValidationError, field_validator
from pydantic.types import EmailStr
class UserCreate(BaseModel):
model_config = ConfigDict(extra="forbid")
email: EmailStr
age: int = Field(ge=13, le=120)
display_name: str = Field(min_length=2, max_length=40)
signup_ts: datetime = Field(default_factory=datetime.utcnow)
@field_validator("display_name")
@classmethod
def no_blank_names(cls, v: str) -> str:
v2 = v.strip()
if not v2:
raise ValueError("display_name cannot be blank")
return v2
good = {"email": "sam@example.com", "age": 29, "display_name": " Sam "}
user = UserCreate.model_validate(good)
print(user.model_dump())
bad = {"email": "nope", "age": 9, "display_name": " ", "admin": True}
try:
UserCreate.model_validate(bad)
except ValidationError as e:
print(e)
2) Cross-field rules (the “business logic” part)
Some rules need multiple fields: date ranges, min/max pairs, or “if A is set then B must be set.”
That’s what model_validator is for.
from datetime import datetime
from pydantic import BaseModel, ValidationError, model_validator
class DateRange(BaseModel):
start: datetime
end: datetime
@model_validator(mode="after")
def validate_range(self) -> "DateRange":
if self.end <= self.start:
raise ValueError("end must be after start")
return self
try:
DateRange.model_validate({"start": "2025-01-02T10:00:00", "end": "2025-01-02T09:00:00"})
except ValidationError as e:
print(e)
3) Data pipeline validation (CSV/JSON rows)
This is the sweet spot for analytics and ETL: validate each row, keep the good ones, and collect structured errors for the bad ones. You can also make fields strict when you want problems to be loud.
from pydantic import BaseModel, ConfigDict, Field, ValidationError
from pydantic.types import StrictInt
class SalesRow(BaseModel):
model_config = ConfigDict(extra="ignore")
order_id: StrictInt
sku: str = Field(min_length=3)
quantity: int = Field(gt=0)
unit_price: float = Field(gt=0)
def validate_rows(rows):
valid, errors = [], []
for i, raw in enumerate(rows):
try:
valid.append(SalesRow.model_validate(raw))
except ValidationError as e:
errors.append({"row_index": i, "issues": e.errors(), "input": raw})
return valid, errors
extra="ignore" here?
Pipelines often get “bonus columns” over time. Ignoring unknown keys keeps you resilient while you decide what to do with them. For APIs, I’m stricter.
4) Typed settings from environment variables
Pydantic Settings lets you define config once and load it from environment variables or a .env file.
It’s a great way to stop “config as string soup.”
from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict
class AppSettings(BaseSettings):
model_config = SettingsConfigDict(env_file=".env", extra="ignore")
app_name: str = "MyApp"
database_url: str
timeout_seconds: int = Field(default=10, ge=1, le=120)
settings = AppSettings()
print(settings.model_dump())
Good default for APIs
extra="forbid"- Reject surprises early
- Cleaner contracts
Good default for pipelines
extra="ignore"- More resilient to upstream changes
- Collect errors without stopping everything
Benefits
- Fast feedback: clear
ValidationErrormessages and structured details. - Fewer mystery bugs: stop bad data at the door instead of debugging it later.
- Better contracts: models can produce JSON Schema for docs and interoperability.
- Speed: Pydantic v2 validation is powered by a Rust core.
- Cleaner code: rules live next to the model, not scattered across your project.
Practical use cases
- API validation: validate request bodies and return friendly errors automatically (FastAPI does this well).
- Analytics pipelines: validate rows, keep good data, log structured errors for the rest.
- Config management: typed settings from environment variables, with bounds checking.
- Data contracts: generate JSON Schema to share “what we expect” with other teams.
Comments
Post a Comment