This project is a robust, scalable, and production-ready email extraction tool built to crawl websites, navigate internal contact pages, and extract verified, relevant email addresses.
It was developed as part of a backend engineering and automation learning journey. The tool is ideal for:
- Collecting recruiter or company emails from a list of websites.
- Building lead-generation datasets for job/internship search.
- Crawling contact pages of tech companies to gather hiring contacts.
- Researching and enriching startup databases with verified emails.
Features:
- Multi-threaded synchronous and asynchronous scraping
- Selenium-based JavaScript rendering
- Proxy rotation and user-agent spoofing
- SQLite checkpointing and deduplication
- Recursive crawling of contact/about/team pages
- Clean, modular Python class
- Python 3.11+
aiohttp,httpx,requests– HTTP fetching (async + sync)Selenium– Rendering JS-heavy websitesBeautifulSoup– HTML parsingSQLite3– Local database for cachingThreadPoolExecutor– Parallelismpandas– Export to CSV/Excel/JSONdotenv– Proxy/environment configs
- Asynchronous programming with
asyncio,aiohttp - Web automation with Selenium
- Anti-blocking strategies: rotating proxies, user-agents, and delays
- Scalable, modular scraping architecture
- SQL-based data caching and deduplication
- CLI interface with
argparse - Logging and error handling
- Multi-format export (CSV, Excel, JSON)
- Job Search Automation: Extract recruiter emails from tech websites or hiring pages.
- Startup Research: Gather contact information for potential partnerships.
- Data Enrichment: Supplement scraped company data with verified emails.
- Lead Generation for B2B: Generate warm leads.
- Academic/Portfolio Projects: Demonstrate backend, scraping, and data pipeline skills.
email_extractor/
│
├── email_extractor.py # Main script
├── email_extractor.db # SQLite database (auto-generated)
├── output.csv # Final output of scraped emails
├── input.csv
├── README.md
└── requirements.txt
Windows:
python -m venv venv
.\venv\Scripts\activatemacOS/Linux:
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtplaywright install
playwright install-depswebdriver-manager install --drivers chrome- Must have a column with URLs (default column name:
url).
python email_extractor.py input.csvpython email_extractor.py input.csv --url_column websitepython email_extractor.py input.csv --domain_filter company.compython email_extractor.py input.csv --recursivepython email_extractor.py input.csv --async_modepython email_extractor.py input.csv --output excel
python email_extractor.py input.csv --output jsonurl
https://example.com
https://anothercompany.com
- Python 3.11+
- ChromeDriver (for Selenium, if scraping JS-heavy sites)
- See
requirements.txtfor all dependencies
Open an issue or reach out!