Practical Web Scraping with Python: A Clean, Safe Pattern for Pulling Company Names & Emails
Practical Web Scraping with Python: A Clean, Safe Pattern for Pulling Company Names & Emails
Turn that quick-and-dirty script into a reliable tool you won’t be afraid to run twice.
The problem
You’ve got a page of companies and you want the name and email for each one. The “first pass” script might work on your machine, but it’s brittle: no retries, no timeouts, no robots.txt check, and it assumes every email is a mailto: link.
The cleaned, safe solution
Below is a production-friendly pattern that:
- Uses a
requests.Sessionwith retries, backoff, and a real User-Agent - Sets sane timeouts and handles common HTTP errors
- Respects
robots.txt(and tells you if scraping is disallowed) - Parses only
mailto:links by default to avoid scraping personal data you shouldn’t - Handles pagination with a “Next” link when present
- Exports to CSV
- Can be run from the command line with arguments
Tip: Treat scraping like you’d treat a database: be polite, predictable, and log what you’re doing.
Copy-paste code
Drop this into scrape_companies.py and run it from your terminal.
#!/usr/bin/env python3
"""
Purpose:
Scrape company names and emails from a listing page using Requests + BeautifulSoup.
Built for reliability: retries, timeouts, robots.txt check, pagination, and CSV export.
Parameters:
--url (str) Root listing URL (e.g., https://example.com/companies)
--out (str) Output CSV file path (default: companies.csv)
--delay (float) Seconds to sleep between requests (default: 0.8)
--user-agent (str) Optional custom User-Agent header
--max-pages (int) Optional page cap for safety (default: 50)
Dependencies:
pip install requests beautifulsoup4 urllib3
Usage notes:
• Respects robots.txt. If disallowed, it will stop and tell you why.
• Extracts only mailto: links for emails by default.
• If your site uses a different structure, adjust CSS selectors in parse_company().
"""
from __future__ import annotations
import argparse
import csv
import logging
import time
import re
from dataclasses import dataclass
from typing import Iterable, Optional, List
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup, Tag
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import urllib.robotparser as robotparser
DEFAULT_UA = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36"
)
@dataclass
class Company:
name: str
email: Optional[str]
page_url: Optional[str] = None
def build_session(user_agent: Optional[str] = None) -> requests.Session:
"""Create a requests session with retries and sensible defaults."""
s = requests.Session()
s.headers.update({"User-Agent": user_agent or DEFAULT_UA})
retry = Retry(
total=5,
backoff_factor=0.5, # exponential backoff (0.5, 1, 2, 4…)
status_forcelist=(429, 500, 502, 503, 504),
allowed_methods=("GET", "HEAD"),
raise_on_status=False,
)
adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=10)
s.mount("http://", adapter)
s.mount("https://", adapter)
return s
def is_allowed_by_robots(root_url: str, user_agent: str) -> bool:
"""Check robots.txt for permission to crawl the path of root_url."""
parsed = urlparse(root_url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = robotparser.RobotFileParser()
try:
rp.set_url(robots_url)
rp.read()
except Exception:
# If robots is unreachable, err on the cautious side but allow with a warning.
logging.warning("Could not fetch robots.txt (%s). Proceeding cautiously.", robots_url)
return True
return rp.can_fetch(user_agent, root_url)
def get_soup(session: requests.Session, url: str, timeout: float = 15.0) -> BeautifulSoup:
"""GET a page with timeout, raise for connection errors, return BeautifulSoup."""
try:
resp = session.get(url, timeout=timeout)
resp.raise_for_status()
except requests.exceptions.RequestException as exc:
raise RuntimeError(f"Request failed for {url}: {exc}") from exc
return BeautifulSoup(resp.text, "html.parser")
def extract_mailto(anchor: Tag) -> Optional[str]:
"""Return email from a if present, else None."""
href = anchor.get("href", "")
if href.startswith("mailto:"):
# Strip 'mailto:' and any query params (?subject=...)
email = href[7:].split("?")[0].strip()
# Basic sanity check
if re.match(r"^[^@\s]+@[^@\s]+\.[^@\s]+$", email):
return email
return None
def parse_company(card: Tag, base_url: str) -> Company:
"""
Parse a single company card.
Adjust selectors here to fit the target site's HTML.
"""
# Defensive lookups: avoid AttributeError if elements are missing
name_tag = card.find(["h2", "h3"], string=True)
name = name_tag.get_text(strip=True) if name_tag else "Unknown"
# Prefer explicit mailto links
email: Optional[str] = None
for a in card.find_all("a", href=True):
email = extract_mailto(a)
if email:
break
# Capture a per-company page link if available (nice for debugging)
detail_link = None
possible = card.find("a", href=True)
if possible:
detail_link = urljoin(base_url, possible["href"])
return Company(name=name, email=email, page_url=detail_link)
def find_next_url(soup: BeautifulSoup, base_url: str) -> Optional[str]:
"""
Try a few common patterns for a 'Next' pagination link.
Adjust as needed for your site.
"""
# rel=next is ideal
rel_next = soup.find("a", rel=lambda v: v and "next" in v)
if rel_next and rel_next.get("href"):
return urljoin(base_url, rel_next["href"])
# aria-label or text
candidates = soup.find_all("a", href=True)
for a in candidates:
text = (a.get_text() or "").strip().lower()
if text in {"next", "next »", "›", "older"}:
return urljoin(base_url, a["href"])
return None
def scrape_companies(root_url: str, delay: float, user_agent: str, max_pages: int) -> List[Company]:
"""Scrape all pages starting from root_url, returning a list of Company rows."""
if not is_allowed_by_robots(root_url, user_agent):
raise PermissionError(
f"robots.txt disallows scraping: {root_url}. Aborting to be polite."
)
session = build_session(user_agent=user_agent)
results: List[Company] = []
current = root_url
pages_seen = 0
while current and pages_seen < max_pages:
pages_seen += 1
logging.info("Fetching page %d: %s", pages_seen, current)
soup = get_soup(session, current)
# Adjust the selector to whatever the site uses for company containers
cards = soup.find_all("div", class_="company")
logging.info("Found %d company cards on this page.", len(cards))
for card in cards:
company = parse_company(card, current)
results.append(company)
# Look for the next page, if any
nxt = find_next_url(soup, current)
if nxt == current:
nxt = None # avoid infinite loop if a site links to itself
current = nxt
time.sleep(delay) # be polite
if pages_seen >= max_pages:
logging.warning("Stopped after max-pages=%d for safety.", max_pages)
return results
def write_csv(rows: Iterable[Company], out_path: str) -> None:
"""Write results to CSV with stable columns."""
with open(out_path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["name", "email", "page_url"])
writer.writeheader()
for r in rows:
writer.writerow({"name": r.name, "email": r.email or "", "page_url": r.page_url or ""})
def main() -> None:
parser = argparse.ArgumentParser(description="Scrape company names and emails.")
parser.add_argument("--url", required=True, help="Listing URL, e.g. https://example.com/companies")
parser.add_argument("--out", default="companies.csv", help="Output CSV path")
parser.add_argument("--delay", type=float, default=0.8, help="Delay between requests (seconds)")
parser.add_argument("--user-agent", default=None, help="Custom User-Agent header")
parser.add_argument("--max-pages", type=int, default=50, help="Safety cap on number of pages")
args = parser.parse_args()
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
datefmt="%H:%M:%S",
)
try:
rows = scrape_companies(
root_url=args.url,
delay=args.delay,
user_agent=args.user_agent or DEFAULT_UA,
max_pages=args.max_pages,
)
write_csv(rows, args.out)
logging.info("Wrote %d rows to %s", len(rows), args.out)
except PermissionError as e:
logging.error(str(e))
except Exception as e:
logging.exception("Unexpected error: %s", e)
if __name__ == "__main__":
main()
Heads-up: This script intentionally extracts only
mailto: emails. Parsing emails from free-form text can cross legal or ethical lines depending on jurisdiction and site terms. If you choose to expand it, talk to your legal team first.
How it works
Reliability
- Retries with backoff: Handles flaky networks and 5xx errors.
- Timeouts: Prevents your script from hanging forever.
- Logging: See what happened without sprinkling
printeverywhere.
Safety & respect
- robots.txt: Checked up front; if disallowed, we stop.
- Polite pacing: Configurable delay between requests.
- Scoped parsing: Only
mailto:links by default.
Where to tweak
- Container selector: Change
soup.find_all("div", class_="company")to match the site. - Name selector: Update
card.find(["h2","h3"], string=True)if names live elsewhere. - Pagination: Extend
find_next_url()for custom next/prev patterns. - Output: Add fields (phone, address) and write more columns to CSV.
How to run it
1) Install dependencies
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install requests beautifulsoup4 urllib3
2) Run against your listing page
python scrape_companies.py \
--url "https://example.com/companies" \
--out companies.csv \
--delay 1.0 \
--max-pages 25
Quick test trick: Save an HTML file locally that mimics the target page. Serve it with
python -m http.server and point --url to http://localhost:8000/test.html. Faster feedback, zero risk.
Practical use cases
- Vendor directory export: Your company lists partners publicly but offers no export. Pull names and public contact emails to reconcile in your CRM.
- Event sponsors list: Conferences often have sponsor grids. Build a CSV to match sponsors to sponsor tiers or internal owner.
- Data quality checks: If you own the site, run the scraper in CI to catch broken links or missing emails after content updates.
- Research projects: For academic or internal analysis, collect public company names into a dataset (no outreach).
Compliance reminder: Outreach rules vary (CAN-SPAM, GDPR, etc.). Public doesn’t always mean “free to harvest for marketing.” Use responsibly.
Pitfalls & guardrails
Common gotchas
- Assuming every email is a
mailto:(it won’t be) - Hard-coding brittle selectors that break on minor redesigns
- Skipping timeouts and hanging the process
- Ignoring pagination and only scraping the first page
Better patterns
- Wrap all network calls with retries + backoff
- Log page counts and result totals for sanity checks
- Keep selectors together with comments on what they match
- Limit with
--max-pagesduring testing
Variations & alternatives
- Async for big crawls: If you’re crawling many pages you own, consider
httpx+asyncio+ a parser likeselectolax. Keep rate limits and robots.txt respect in place. - Structured export: Swap CSV for JSONL if you want line-by-line processing downstream.
- Headless browser: If content is JavaScript-rendered, use Playwright. It’s heavier but handles SPAs. Still be polite.
- Richer fields: Extend
Companyto include phone, address, and a normalized domain. Add validation and dedupe before export.
What to try next
- Point the script at a page you control to validate selectors.
- Add a simple
domaincolumn usingurlparseon eachpage_url. - Pipe the CSV into your analytics stack (Power BI, Python, or SQL) for reporting.
- Wrap it with a weekly job and alert on large diffs to detect site changes.
Comments
Post a Comment