Import from internal repository

Tom Hudson · Tom Hudson · commit a7457025ceab · 2021-06-03T15:54:20.000+01:00
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.*sw*
+out
+page-fetch
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,9 @@
+MIT License
+
+Copyright 2021 Detectify
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,180 @@
+# page-fetch
+
+page-fetch is a tool for researchers that lets you:
+
+* Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files
+* Run arbitrary JavaScript on many web pages and see the returned values
+
+
+## Installation
+
+page-fetch is written with Go and can be installed with `go get`:
+
+```
+▶ go get github.com/detectify/page-fetch
+```
+
+Or you can clone the respository and build it manually:
+
+```
+▶ git clone https://github.com/detectify/page-fetch.git
+▶ cd page-fetch
+▶ go install
+```
+
+### Dependencies
+
+page-fetch uses [chromedp](https://github.com/chromedp/chromedp), which requires
+that a Chrome or Chromium browser be installed. It uses the following list of
+executable names in attempting to execute a browser:
+
+* `headless_shell`
+* `headless-shell`
+* `chromium`
+* `chromium-browser`
+* `google-chrome`
+* `google-chrome-stable`
+* `google-chrome-beta`
+* `google-chrome-unstable`
+* `/usr/bin/google-chrome`
+
+
+## Basic Usage
+
+page-fetch takes a list of URLs as its input on `stdin`. You can provide the input list using IO redirection:
+
+```
+▶ page-fetch < urls.txt
+```
+
+Or using the output of another command:
+
+```
+▶ grep admin urls.txt | page-fetch
+```
+
+By default, responses are stored in a directory called 'out', which is created if it does not exist:
+
+```
+▶ echo https://detectify.com | page-fetch
+GET https://detectify.com/ 200 text/html; charset=utf-8
+GET https://detectify.com/site/themes/detectify/css/detectify.css?v=1621498751 200 text/css
+GET https://detectify.com/site/themes/detectify/img/detectify_logo_black.svg 200 image/svg+xml
+GET https://fonts.googleapis.com/css?family=Merriweather:300i 200 text/css; charset=utf-8
+...
+▶ tree out
+out
+├── detectify.com
+│   ├── index
+│   ├── index.meta
+│   └── site
+│       └── themes
+│           └── detectify
+│               ├── css
+│               │   ├── detectify.css
+│               │   └── detectify.css.meta
+...
+```
+
+The directory structure used in the output directory mirrors the directory structure used on the target websites.
+A ".meta" file is stored for each request that contains the originally requested URL, including the query string),
+the request and response headers etc.
+
+
+## Options
+
+You can get the page-fetch help output by running `page-fetch -h`:
+
+```
+▶ page-fetch -h
+Request URLs using headless Chrome, storing the results
+
+Usage:
+  page-fetch [options] < urls.txt
+
+Options:
+  -c, --concurrency <int>   Concurrency Level (default 2)
+  -e, --exclude <string>    Do not save responses matching the provided string (can be specified multiple times)
+  -i, --include <string>    Only save requests matching the provided string (can be specified multiple times)
+  -j, --javascript <string> JavaScript to run on each page
+  -o, --output <string>     Output directory name (default 'out')
+  -w, --overwrite           Overwrite output files when they already exist
+      --no-third-party      Do not save responses to requests on third-party domains
+      --third-party         Only save responses to requests on third-party domains
+```
+
+### Concurrency
+
+You can change how many headless Chrome processes are used with the `-c` / `--concurrency` option.
+The default value is 2.
+
+### Excluding responses based on content-type
+
+You can choose to not save responses that match particular content types with the `-e` / `--exclude` option.
+Any response with a content-type that partially matches the provided value will not be stored; so you can,
+for example, avoid storing image files by specifying:
+
+```
+▶ page-fetch --exclude image/
+```
+
+The option can be specified multiple times to exclude multiple different content-types.
+
+### Including responses based on content-type
+
+Rather than excluding specific content-types, you can opt to only save certain content-types with the
+`-i` / `--include` option:
+
+```
+▶ page-fetch --include text/html
+```
+
+The option can be specified multiple times to include multiple different content-types.
+
+### Running JavaScript on each page
+
+You can run arbitrary JavaScript on each page with the `-j` / `--javascript` option. The return value
+of the JavaScript is converted to a string and printed on a line prefixed with "JS":
+
+```
+▶ echo https://example.com | page-fetch --javascript document.domain
+GET https://example.com/ 200 text/html; charset=utf-8
+JS (https://example.com): example.com
+```
+
+This option can be used for a very wide variety of purposes. As an example, you could extract the `href`
+attribute from all links on a webpage:
+
+```
+▶ echo https://example.com | page-fetch --javascript '[...document.querySelectorAll("a")].map(n => n.href)' | grep ^JS
+JS (https://example.com): [https://www.iana.org/domains/example]
+```
+
+### Setting the output directory name
+
+By default, files are stored in a directory called `out`. This can be changed with the `-o` / `--output` option:
+
+```
+▶ echo https://example.com | page-fetch --output example
+GET https://example.com/ 200 text/html; charset=utf-8
+▶ find example/ -type f
+example/example.com/index
+example/example.com/index.meta
+```
+
+The directory is created if it does not already exist.
+
+### Overwriting files
+
+By default, when a file already exists, a new file is created with a numeric suffix, e.g. if `index` already exists,
+`index.1` will be created. This behaviour can be overridden with the `-w` / `--overwrite` option. When the option is
+used matching files will be overwritten instead.
+
+### Excluding third-party responses
+
+You may sometimes wish to exclude responses from third-party domains. This can be done with the `--no-third-party` option.
+Any responses to requests for domains that do not match the input URL, or one of its subdomains, will not be saved.
+
+### Including only third-party responses
+
+On rare occasions you may wish to *only* store responses to third party domains. This can be done with the `--third-party` option.
diff --git a/go.mod b/go.mod
@@ -0,0 +1,9 @@
+module git.detectify.net/tomhudson/page-fetch
+
+go 1.15
+
+require (
+	github.com/chromedp/cdproto v0.0.0-20210526005521-9e51b9051fd0
+	github.com/chromedp/chromedp v0.7.3
+	golang.org/x/net v0.0.0-20210525063256-abc453219eb5
+)
diff --git a/go.sum b/go.sum
@@ -0,0 +1,26 @@
+github.com/chromedp/cdproto v0.0.0-20210526005521-9e51b9051fd0 h1:aIcgRshD5I1MfJfB92KBDKpaXrYqj3fkqI8bHdtP3zA=
+github.com/chromedp/cdproto v0.0.0-20210526005521-9e51b9051fd0/go.mod h1:At5TxYYdxkbQL0TSefRjhLE3Q0lgvqKKMSFUglJ7i1U=
+github.com/chromedp/chromedp v0.7.3 h1:FvgJICfjvXtDX+miuMUY0NHuY8zQvjS/TcEQEG6Ldzs=
+github.com/chromedp/chromedp v0.7.3/go.mod h1:9gC521Yzgrk078Ulv6KIgG7hJ2x9aWrxMBBobTFk30A=
+github.com/chromedp/sysutil v1.0.0 h1:+ZxhTpfpZlmchB58ih/LBHX52ky7w2VhQVKQMucy3Ic=
+github.com/chromedp/sysutil v1.0.0/go.mod h1:kgWmDdq8fTzXYcKIBqIYvRRTnYb9aNS9moAV0xufSww=
+github.com/gobwas/httphead v0.1.0 h1:exrUm0f4YX0L7EBwZHuCF4GDp8aJfVeBrlLQrs6NqWU=
+github.com/gobwas/httphead v0.1.0/go.mod h1:O/RXo79gxV8G+RqlR/otEwx4Q36zl9rqC5u12GKvMCM=
+github.com/gobwas/pool v0.2.1 h1:xfeeEhW7pwmX8nuLVlqbzVc7udMDrwetjEv+TZIz1og=
+github.com/gobwas/pool v0.2.1/go.mod h1:q8bcK0KcYlCgd9e7WYLm9LpyS+YeLd8JVDW6WezmKEw=
+github.com/gobwas/ws v1.1.0-rc.5 h1:QOAag7FoBaBYYHRqzqkhhd8fq5RTubvI4v3Ft/gDVVQ=
+github.com/gobwas/ws v1.1.0-rc.5/go.mod h1:nzvNcVha5eUziGrbxFCo6qFIojQHjJV5cLYIbezhfL0=
+github.com/josharian/intern v1.0.0 h1:vlS4z54oSdjm0bgjRigI+G1HpF+tI+9rE5LLzOg8HmY=
+github.com/josharian/intern v1.0.0/go.mod h1:5DoeVV0s6jJacbCEi61lwdGj/aVlrQvzHFFd8Hwg//Y=
+github.com/mailru/easyjson v0.7.7 h1:UGYAvKxe3sBsEDzO8ZeWOSlIQfWFlxbzLZe7hwFURr0=
+github.com/mailru/easyjson v0.7.7/go.mod h1:xzfreul335JAWq5oZzymOObrkdz5UnU4kGfJJLY9Nlc=
+golang.org/x/net v0.0.0-20210525063256-abc453219eb5 h1:wjuX4b5yYQnEQHzd+CBcrcC6OVR2J1CN6mUy0oSxIPo=
+golang.org/x/net v0.0.0-20210525063256-abc453219eb5/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y=
+golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
+golang.org/x/sys v0.0.0-20201207223542-d4d67f95c62d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
+golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
+golang.org/x/sys v0.0.0-20210525143221-35b2ab0089ea h1:+WiDlPBBaO+h9vPNZi8uJ3k4BkKQB7Iow3aqwHVA5hI=
+golang.org/x/sys v0.0.0-20210525143221-35b2ab0089ea/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
+golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
+golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
+golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
diff --git a/main.go b/main.go