A Quick Python Script to Extract Emails From a Web Page

A Quick Python Script to Extract Emails From a Web Page

Sometimes you land on a website and think: I know there’s a contact email here somewhere… but it’s buried in a footer, hidden on an “About” page, or lost in a long block of text. If you do research, vendor reviews, outreach, or you’re just trying to save time, copying and pasting around a page gets old fast.

This short Python script automates that step. You give it a URL, it downloads the page, pulls out the readable text, then returns any email addresses it finds.


What the script does

At a high level, it does four things:

  1. Fetches the page using requests
  2. Parses the HTML and extracts visible body text using selectolax
  3. Searches for email patterns with a regular expression (regex)
  4. Removes duplicates so you get a clean list of unique results

The output is a simple list like:

['info@company.com', 'support@company.com']

Why this is useful

This is a small script, but it solves a real problem quickly. Here are a few practical ways to use it:

  • Quick contact discovery: Find emails on a contact/about page without hunting through the layout.
  • Research: Speed up vendor checks, partner research, or general due diligence.
  • Automation: Plug it into a workflow that checks multiple URLs and logs results to a file.
  • Public-data collection: Build a list of publicly available contact emails (responsibly).

Install the dependencies

You’ll need two libraries:

pip install requests selectolax

The script (cleaned up, with comments)

import re
from typing import List, Set

import requests
from selectolax.parser import HTMLParser

# Compile the regex once (faster + clearer than compiling every call).
# This matches common email formats like: name@example.com
EMAIL_REGEX = re.compile(
    r"[a-zA-Z0-9._%+-]+@"          # local part
    r"[a-zA-Z0-9.-]+\."            # domain + dot
    r"[a-zA-Z]{2,}"                # top-level domain (simple form)
)


def extract_emails(url: str, timeout_seconds: int = 10) -> List[str]:
    """
    Fetch a web page and extract unique email-like strings from the visible body text.

    Args:
        url: The page URL to fetch.
        timeout_seconds: Network timeout for the HTTP request.

    Returns:
        A sorted list of unique emails found (empty list if none are found).
    """
    headers = {
        # Helps avoid basic blocks that reject default Python requests
        "User-Agent": "Mozilla/5.0 (email-extractor script)"
    }

    try:
        # Download the page
        response = requests.get(url, headers=headers, timeout=timeout_seconds)
        response.raise_for_status()  # raises HTTPError for 4xx/5xx responses
    except requests.RequestException as exc:
        # Covers timeout, DNS failure, invalid URL, HTTP errors, etc.
        print(f"Request failed for {url}: {exc}")
        return []

    # Parse HTML into a tree
    tree = HTMLParser(response.text)

    # If there's no <body> (malformed HTML), there's nothing to scan
    if tree.body is None:
        return []

    # Extract visible text from the body and scan it for emails
    visible_text = tree.body.text()
    emails: Set[str] = set(EMAIL_REGEX.findall(visible_text))

    # Return a stable order
    return sorted(emails)


if __name__ == "__main__":
    # Example usage:
    print(extract_emails("https://example.com"))

How to use it

Replace the URL with a page where you expect contact info:

print(extract_emails("https://somesite.com/contact"))

If you have multiple pages, loop through a list:

urls = [
    "https://example.com",
    "https://example.org/contact",
]

for url in urls:
    emails = extract_emails(url)
    print(url, "->", emails)

Limitations (so you’re not surprised)

This script searches only the visible body text. That keeps it simple and fast, but it also means:

  • It won’t detect emails inside images.
  • It may miss emails that only appear in JavaScript or embedded JSON.
  • Some sites intentionally obfuscate emails (example: name [at] domain [dot] com).

If you want more coverage, common upgrades are: checking mailto: links, handling obfuscation, and crawling internal pages like /contact or /about.


Use responsibly

Only collect emails that are publicly available and relevant to your purpose. Respect site terms and avoid hammering websites with repeated requests.


Wrap-up

This isn’t a full-blown crawler. It’s a small, practical utility that saves time when you need to quickly pull email addresses from a page. If you’re doing research or building a simple automation workflow, it’s a clean starting point—and easy to extend later.

Comments

Popular posts from this blog

Open SSRS Linked URLS in a new window

SSRS Font Weight expressions