This tool crawls a web page and downloads linked documents (PDF, DOCX, XLSX, etc.).
Usage (Windows PowerShell):
- Create and activate a virtual environment (optional but recommended).
- Install requirements.
- Run the crawler.
Example:
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python crawler.py --url https://pngrb.gov.in/eng-web/regulation-t4s.html --out downloads
Options:
--url: Page URL to crawl (required)--out: Output folder (default:downloads)--ext: Space-separated list of extensions to download (default includes pdf, docx, xlsx, pptx, zip)--all-domains: Allow downloads from any domain (default is same domain only)--delay: Delay seconds between downloads--max: Maximum number of files to download
Run tests:
pytest -q