Practical Web Scraping with Python: A Clean, Safe Pattern for Pulling Company Names & Emails

Turn that quick-and-dirty script into a reliable tool you won’t be afraid to run twice.

Meta description suggestion: Learn a production-friendly pattern for scraping company names and emails with Python, Requests, and BeautifulSoup—featuring retries, timeouts, robots.txt checks, pagination, and CSV export.

The problem

You’ve got a page of companies and you want the name and email for each one. The “first pass” script might work on your machine, but it’s brittle: no retries, no timeouts, no robots.txt check, and it assumes every email is a mailto: link.

The cleaned, safe solution

Below is a production-friendly pattern that:

Uses a requests.Session with retries, backoff, and a real User-Agent
Sets sane timeouts and handles common HTTP errors
Respects robots.txt (and tells you if scraping is disallowed)
Parses only mailto: links by default to avoid scraping personal data you shouldn’t
Handles pagination with a “Next” link when present
Exports to CSV
Can be run from the command line with arguments

Tip: Treat scraping like you’d treat a database: be polite, predictable, and log what you’re doing.

Copy-paste code

Drop this into scrape_companies.py and run it from your terminal.

#!/usr/bin/env python3
"""
Purpose:
    Scrape company names and emails from a listing page using Requests + BeautifulSoup.
    Built for reliability: retries, timeouts, robots.txt check, pagination, and CSV export.

Parameters:
    --url (str)           Root listing URL (e.g., https://example.com/companies)
    --out (str)           Output CSV file path (default: companies.csv)
    --delay (float)       Seconds to sleep between requests (default: 0.8)
    --user-agent (str)    Optional custom User-Agent header
    --max-pages (int)     Optional page cap for safety (default: 50)

Dependencies:
    pip install requests beautifulsoup4 urllib3

Usage notes:
    • Respects robots.txt. If disallowed, it will stop and tell you why.
    • Extracts only mailto: links for emails by default.
    • If your site uses a different structure, adjust CSS selectors in parse_company().
"""

from __future__ import annotations

import argparse
import csv
import logging
import time
import re
from dataclasses import dataclass
from typing import Iterable, Optional, List
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup, Tag
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import urllib.robotparser as robotparser

DEFAULT_UA = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/122.0.0.0 Safari/537.36"
)

@dataclass
class Company:
    name: str
    email: Optional[str]
    page_url: Optional[str] = None

def build_session(user_agent: Optional[str] = None) -> requests.Session:
    """Create a requests session with retries and sensible defaults."""
    s = requests.Session()
    s.headers.update({"User-Agent": user_agent or DEFAULT_UA})
    retry = Retry(
        total=5,
        backoff_factor=0.5,           # exponential backoff (0.5, 1, 2, 4…)
        status_forcelist=(429, 500, 502, 503, 504),
        allowed_methods=("GET", "HEAD"),
        raise_on_status=False,
    )
    adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=10)
    s.mount("http://", adapter)
    s.mount("https://", adapter)
    return s

def is_allowed_by_robots(root_url: str, user_agent: str) -> bool:
    """Check robots.txt for permission to crawl the path of root_url."""
    parsed = urlparse(root_url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    rp = robotparser.RobotFileParser()
    try:
        rp.set_url(robots_url)
        rp.read()
    except Exception:
        # If robots is unreachable, err on the cautious side but allow with a warning.
        logging.warning("Could not fetch robots.txt (%s). Proceeding cautiously.", robots_url)
        return True
    return rp.can_fetch(user_agent, root_url)

def get_soup(session: requests.Session, url: str, timeout: float = 15.0) -> BeautifulSoup:
    """GET a page with timeout, raise for connection errors, return BeautifulSoup."""
    try:
        resp = session.get(url, timeout=timeout)
        resp.raise_for_status()
    except requests.exceptions.RequestException as exc:
        raise RuntimeError(f"Request failed for {url}: {exc}") from exc
    return BeautifulSoup(resp.text, "html.parser")

def extract_mailto(anchor: Tag) -> Optional[str]:
    """Return email from a  if present, else None."""
    href = anchor.get("href", "")
    if href.startswith("mailto:"):
        # Strip 'mailto:' and any query params (?subject=...)
        email = href[7:].split("?")[0].strip()
        # Basic sanity check
        if re.match(r"^[^@\s]+@[^@\s]+\.[^@\s]+$", email):
            return email
    return None

def parse_company(card: Tag, base_url: str) -> Company:
    """
    Parse a single company card.
    Adjust selectors here to fit the target site's HTML.
    """
    # Defensive lookups: avoid AttributeError if elements are missing
    name_tag = card.find(["h2", "h3"], string=True)
    name = name_tag.get_text(strip=True) if name_tag else "Unknown"

    # Prefer explicit mailto links
    email: Optional[str] = None
    for a in card.find_all("a", href=True):
        email = extract_mailto(a)
        if email:
            break

    # Capture a per-company page link if available (nice for debugging)
    detail_link = None
    possible = card.find("a", href=True)
    if possible:
        detail_link = urljoin(base_url, possible["href"])

    return Company(name=name, email=email, page_url=detail_link)

def find_next_url(soup: BeautifulSoup, base_url: str) -> Optional[str]:
    """
    Try a few common patterns for a 'Next' pagination link.
    Adjust as needed for your site.
    """
    # rel=next is ideal
    rel_next = soup.find("a", rel=lambda v: v and "next" in v)
    if rel_next and rel_next.get("href"):
        return urljoin(base_url, rel_next["href"])

    # aria-label or text
    candidates = soup.find_all("a", href=True)
    for a in candidates:
        text = (a.get_text() or "").strip().lower()
        if text in {"next", "next »", "›", "older"}:
            return urljoin(base_url, a["href"])

    return None

def scrape_companies(root_url: str, delay: float, user_agent: str, max_pages: int) -> List[Company]:
    """Scrape all pages starting from root_url, returning a list of Company rows."""
    if not is_allowed_by_robots(root_url, user_agent):
        raise PermissionError(
            f"robots.txt disallows scraping: {root_url}. Aborting to be polite."
        )

    session = build_session(user_agent=user_agent)
    results: List[Company] = []

    current = root_url
    pages_seen = 0

    while current and pages_seen < max_pages:
        pages_seen += 1
        logging.info("Fetching page %d: %s", pages_seen, current)
        soup = get_soup(session, current)

        # Adjust the selector to whatever the site uses for company containers
        cards = soup.find_all("div", class_="company")
        logging.info("Found %d company cards on this page.", len(cards))

        for card in cards:
            company = parse_company(card, current)
            results.append(company)

        # Look for the next page, if any
        nxt = find_next_url(soup, current)
        if nxt == current:
            nxt = None  # avoid infinite loop if a site links to itself
        current = nxt

        time.sleep(delay)  # be polite

    if pages_seen >= max_pages:
        logging.warning("Stopped after max-pages=%d for safety.", max_pages)

    return results

def write_csv(rows: Iterable[Company], out_path: str) -> None:
    """Write results to CSV with stable columns."""
    with open(out_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["name", "email", "page_url"])
        writer.writeheader()
        for r in rows:
            writer.writerow({"name": r.name, "email": r.email or "", "page_url": r.page_url or ""})

def main() -> None:
    parser = argparse.ArgumentParser(description="Scrape company names and emails.")
    parser.add_argument("--url", required=True, help="Listing URL, e.g. https://example.com/companies")
    parser.add_argument("--out", default="companies.csv", help="Output CSV path")
    parser.add_argument("--delay", type=float, default=0.8, help="Delay between requests (seconds)")
    parser.add_argument("--user-agent", default=None, help="Custom User-Agent header")
    parser.add_argument("--max-pages", type=int, default=50, help="Safety cap on number of pages")
    args = parser.parse_args()

    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s %(levelname)s %(message)s",
        datefmt="%H:%M:%S",
    )

    try:
        rows = scrape_companies(
            root_url=args.url,
            delay=args.delay,
            user_agent=args.user_agent or DEFAULT_UA,
            max_pages=args.max_pages,
        )
        write_csv(rows, args.out)
        logging.info("Wrote %d rows to %s", len(rows), args.out)
    except PermissionError as e:
        logging.error(str(e))
    except Exception as e:
        logging.exception("Unexpected error: %s", e)

if __name__ == "__main__":
    main()

Heads-up: This script intentionally extracts only mailto: emails. Parsing emails from free-form text can cross legal or ethical lines depending on jurisdiction and site terms. If you choose to expand it, talk to your legal team first.

How it works

Reliability

Retries with backoff: Handles flaky networks and 5xx errors.
Timeouts: Prevents your script from hanging forever.
Logging: See what happened without sprinkling print everywhere.

Safety & respect

robots.txt: Checked up front; if disallowed, we stop.
Polite pacing: Configurable delay between requests.
Scoped parsing: Only mailto: links by default.

Where to tweak

Container selector: Change soup.find_all("div", class_="company") to match the site.
Name selector: Update card.find(["h2","h3"], string=True) if names live elsewhere.
Pagination: Extend find_next_url() for custom next/prev patterns.
Output: Add fields (phone, address) and write more columns to CSV.

How to run it

1) Install dependencies

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install requests beautifulsoup4 urllib3

2) Run against your listing page

python scrape_companies.py \
  --url "https://example.com/companies" \
  --out companies.csv \
  --delay 1.0 \
  --max-pages 25

Quick test trick: Save an HTML file locally that mimics the target page. Serve it with python -m http.server and point --url to http://localhost:8000/test.html. Faster feedback, zero risk.

Practical use cases

Vendor directory export: Your company lists partners publicly but offers no export. Pull names and public contact emails to reconcile in your CRM.
Event sponsors list: Conferences often have sponsor grids. Build a CSV to match sponsors to sponsor tiers or internal owner.
Data quality checks: If you own the site, run the scraper in CI to catch broken links or missing emails after content updates.
Research projects: For academic or internal analysis, collect public company names into a dataset (no outreach).

Compliance reminder: Outreach rules vary (CAN-SPAM, GDPR, etc.). Public doesn’t always mean “free to harvest for marketing.” Use responsibly.

Pitfalls & guardrails

Common gotchas

Assuming every email is a mailto: (it won’t be)
Hard-coding brittle selectors that break on minor redesigns
Skipping timeouts and hanging the process
Ignoring pagination and only scraping the first page

Better patterns

Wrap all network calls with retries + backoff
Log page counts and result totals for sanity checks
Keep selectors together with comments on what they match
Limit with --max-pages during testing

Variations & alternatives

Async for big crawls: If you’re crawling many pages you own, consider httpx + asyncio + a parser like selectolax. Keep rate limits and robots.txt respect in place.
Structured export: Swap CSV for JSONL if you want line-by-line processing downstream.
Headless browser: If content is JavaScript-rendered, use Playwright. It’s heavier but handles SPAs. Still be polite.
Richer fields: Extend Company to include phone, address, and a normalized domain. Add validation and dedupe before export.

What to try next

Point the script at a page you control to validate selectors.
Add a simple domain column using urlparse on each page_url.
Pipe the CSV into your analytics stack (Power BI, Python, or SQL) for reporting.
Wrap it with a weekly job and alert on large diffs to detect site changes.

Search This Blog

Another BI & Programming Blog - Jason Yousef