Skip to content

editorialss/Crawler-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PNGRA Regulations Document Crawler

This tool crawls a web page and downloads linked documents (PDF, DOCX, XLSX, etc.).

Usage (Windows PowerShell):

  1. Create and activate a virtual environment (optional but recommended).
  2. Install requirements.
  3. Run the crawler.

Example:

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python crawler.py --url https://pngrb.gov.in/eng-web/regulation-t4s.html --out downloads

Options:

  • --url: Page URL to crawl (required)
  • --out: Output folder (default: downloads)
  • --ext: Space-separated list of extensions to download (default includes pdf, docx, xlsx, pptx, zip)
  • --all-domains: Allow downloads from any domain (default is same domain only)
  • --delay: Delay seconds between downloads
  • --max: Maximum number of files to download

Run tests:

pytest -q

About

It is a test crawler for Amman and others

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages