A Quick Python Script to Extract Emails From a Web Page
A Quick Python Script to Extract Emails From a Web Page
Sometimes you land on a website and think: I know there’s a contact email here somewhere… but it’s buried in a footer, hidden on an “About” page, or lost in a long block of text. If you do research, vendor reviews, outreach, or you’re just trying to save time, copying and pasting around a page gets old fast.
This short Python script automates that step. You give it a URL, it downloads the page, pulls out the readable text, then returns any email addresses it finds.
What the script does
At a high level, it does four things:
- Fetches the page using
requests - Parses the HTML and extracts visible body text using
selectolax - Searches for email patterns with a regular expression (regex)
- Removes duplicates so you get a clean list of unique results
The output is a simple list like:
['info@company.com', 'support@company.com']
Why this is useful
This is a small script, but it solves a real problem quickly. Here are a few practical ways to use it:
- Quick contact discovery: Find emails on a contact/about page without hunting through the layout.
- Research: Speed up vendor checks, partner research, or general due diligence.
- Automation: Plug it into a workflow that checks multiple URLs and logs results to a file.
- Public-data collection: Build a list of publicly available contact emails (responsibly).
Install the dependencies
You’ll need two libraries:
pip install requests selectolax
The script (cleaned up, with comments)
import re
from typing import List, Set
import requests
from selectolax.parser import HTMLParser
# Compile the regex once (faster + clearer than compiling every call).
# This matches common email formats like: name@example.com
EMAIL_REGEX = re.compile(
r"[a-zA-Z0-9._%+-]+@" # local part
r"[a-zA-Z0-9.-]+\." # domain + dot
r"[a-zA-Z]{2,}" # top-level domain (simple form)
)
def extract_emails(url: str, timeout_seconds: int = 10) -> List[str]:
"""
Fetch a web page and extract unique email-like strings from the visible body text.
Args:
url: The page URL to fetch.
timeout_seconds: Network timeout for the HTTP request.
Returns:
A sorted list of unique emails found (empty list if none are found).
"""
headers = {
# Helps avoid basic blocks that reject default Python requests
"User-Agent": "Mozilla/5.0 (email-extractor script)"
}
try:
# Download the page
response = requests.get(url, headers=headers, timeout=timeout_seconds)
response.raise_for_status() # raises HTTPError for 4xx/5xx responses
except requests.RequestException as exc:
# Covers timeout, DNS failure, invalid URL, HTTP errors, etc.
print(f"Request failed for {url}: {exc}")
return []
# Parse HTML into a tree
tree = HTMLParser(response.text)
# If there's no <body> (malformed HTML), there's nothing to scan
if tree.body is None:
return []
# Extract visible text from the body and scan it for emails
visible_text = tree.body.text()
emails: Set[str] = set(EMAIL_REGEX.findall(visible_text))
# Return a stable order
return sorted(emails)
if __name__ == "__main__":
# Example usage:
print(extract_emails("https://example.com"))
How to use it
Replace the URL with a page where you expect contact info:
print(extract_emails("https://somesite.com/contact"))
If you have multiple pages, loop through a list:
urls = [
"https://example.com",
"https://example.org/contact",
]
for url in urls:
emails = extract_emails(url)
print(url, "->", emails)
Limitations (so you’re not surprised)
This script searches only the visible body text. That keeps it simple and fast, but it also means:
- It won’t detect emails inside images.
- It may miss emails that only appear in JavaScript or embedded JSON.
- Some sites intentionally obfuscate emails (example:
name [at] domain [dot] com).
If you want more coverage, common upgrades are:
checking mailto: links, handling obfuscation, and crawling internal pages like
/contact or /about.
Use responsibly
Only collect emails that are publicly available and relevant to your purpose. Respect site terms and avoid hammering websites with repeated requests.
Wrap-up
This isn’t a full-blown crawler. It’s a small, practical utility that saves time when you need to quickly pull email addresses from a page. If you’re doing research or building a simple automation workflow, it’s a clean starting point—and easy to extend later.
Comments
Post a Comment