Skip to content

Commit a745702

Browse files
author
Tom Hudson
committed
Import from internal repository
0 parents  commit a745702

File tree

6 files changed

+604
-0
lines changed

6 files changed

+604
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.*sw*
2+
out
3+
page-fetch

LICENSE

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
MIT License
2+
3+
Copyright 2021 Detectify
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
6+
7+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
8+
9+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# page-fetch
2+
3+
page-fetch is a tool for researchers that lets you:
4+
5+
* Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files
6+
* Run arbitrary JavaScript on many web pages and see the returned values
7+
8+
9+
## Installation
10+
11+
page-fetch is written with Go and can be installed with `go get`:
12+
13+
```
14+
▶ go get github.com/detectify/page-fetch
15+
```
16+
17+
Or you can clone the respository and build it manually:
18+
19+
```
20+
▶ git clone https://github.com/detectify/page-fetch.git
21+
▶ cd page-fetch
22+
▶ go install
23+
```
24+
25+
### Dependencies
26+
27+
page-fetch uses [chromedp](https://github.com/chromedp/chromedp), which requires
28+
that a Chrome or Chromium browser be installed. It uses the following list of
29+
executable names in attempting to execute a browser:
30+
31+
* `headless_shell`
32+
* `headless-shell`
33+
* `chromium`
34+
* `chromium-browser`
35+
* `google-chrome`
36+
* `google-chrome-stable`
37+
* `google-chrome-beta`
38+
* `google-chrome-unstable`
39+
* `/usr/bin/google-chrome`
40+
41+
42+
## Basic Usage
43+
44+
page-fetch takes a list of URLs as its input on `stdin`. You can provide the input list using IO redirection:
45+
46+
```
47+
▶ page-fetch < urls.txt
48+
```
49+
50+
Or using the output of another command:
51+
52+
```
53+
▶ grep admin urls.txt | page-fetch
54+
```
55+
56+
By default, responses are stored in a directory called 'out', which is created if it does not exist:
57+
58+
```
59+
▶ echo https://detectify.com | page-fetch
60+
GET https://detectify.com/ 200 text/html; charset=utf-8
61+
GET https://detectify.com/site/themes/detectify/css/detectify.css?v=1621498751 200 text/css
62+
GET https://detectify.com/site/themes/detectify/img/detectify_logo_black.svg 200 image/svg+xml
63+
GET https://fonts.googleapis.com/css?family=Merriweather:300i 200 text/css; charset=utf-8
64+
...
65+
▶ tree out
66+
out
67+
├── detectify.com
68+
│   ├── index
69+
│   ├── index.meta
70+
│   └── site
71+
│   └── themes
72+
│   └── detectify
73+
│   ├── css
74+
│   │   ├── detectify.css
75+
│   │   └── detectify.css.meta
76+
...
77+
```
78+
79+
The directory structure used in the output directory mirrors the directory structure used on the target websites.
80+
A ".meta" file is stored for each request that contains the originally requested URL, including the query string),
81+
the request and response headers etc.
82+
83+
84+
## Options
85+
86+
You can get the page-fetch help output by running `page-fetch -h`:
87+
88+
```
89+
▶ page-fetch -h
90+
Request URLs using headless Chrome, storing the results
91+
92+
Usage:
93+
page-fetch [options] < urls.txt
94+
95+
Options:
96+
-c, --concurrency <int> Concurrency Level (default 2)
97+
-e, --exclude <string> Do not save responses matching the provided string (can be specified multiple times)
98+
-i, --include <string> Only save requests matching the provided string (can be specified multiple times)
99+
-j, --javascript <string> JavaScript to run on each page
100+
-o, --output <string> Output directory name (default 'out')
101+
-w, --overwrite Overwrite output files when they already exist
102+
--no-third-party Do not save responses to requests on third-party domains
103+
--third-party Only save responses to requests on third-party domains
104+
```
105+
106+
### Concurrency
107+
108+
You can change how many headless Chrome processes are used with the `-c` / `--concurrency` option.
109+
The default value is 2.
110+
111+
### Excluding responses based on content-type
112+
113+
You can choose to not save responses that match particular content types with the `-e` / `--exclude` option.
114+
Any response with a content-type that partially matches the provided value will not be stored; so you can,
115+
for example, avoid storing image files by specifying:
116+
117+
```
118+
▶ page-fetch --exclude image/
119+
```
120+
121+
The option can be specified multiple times to exclude multiple different content-types.
122+
123+
### Including responses based on content-type
124+
125+
Rather than excluding specific content-types, you can opt to only save certain content-types with the
126+
`-i` / `--include` option:
127+
128+
```
129+
▶ page-fetch --include text/html
130+
```
131+
132+
The option can be specified multiple times to include multiple different content-types.
133+
134+
### Running JavaScript on each page
135+
136+
You can run arbitrary JavaScript on each page with the `-j` / `--javascript` option. The return value
137+
of the JavaScript is converted to a string and printed on a line prefixed with "JS":
138+
139+
```
140+
▶ echo https://example.com | page-fetch --javascript document.domain
141+
GET https://example.com/ 200 text/html; charset=utf-8
142+
JS (https://example.com): example.com
143+
```
144+
145+
This option can be used for a very wide variety of purposes. As an example, you could extract the `href`
146+
attribute from all links on a webpage:
147+
148+
```
149+
▶ echo https://example.com | page-fetch --javascript '[...document.querySelectorAll("a")].map(n => n.href)' | grep ^JS
150+
JS (https://example.com): [https://www.iana.org/domains/example]
151+
```
152+
153+
### Setting the output directory name
154+
155+
By default, files are stored in a directory called `out`. This can be changed with the `-o` / `--output` option:
156+
157+
```
158+
▶ echo https://example.com | page-fetch --output example
159+
GET https://example.com/ 200 text/html; charset=utf-8
160+
▶ find example/ -type f
161+
example/example.com/index
162+
example/example.com/index.meta
163+
```
164+
165+
The directory is created if it does not already exist.
166+
167+
### Overwriting files
168+
169+
By default, when a file already exists, a new file is created with a numeric suffix, e.g. if `index` already exists,
170+
`index.1` will be created. This behaviour can be overridden with the `-w` / `--overwrite` option. When the option is
171+
used matching files will be overwritten instead.
172+
173+
### Excluding third-party responses
174+
175+
You may sometimes wish to exclude responses from third-party domains. This can be done with the `--no-third-party` option.
176+
Any responses to requests for domains that do not match the input URL, or one of its subdomains, will not be saved.
177+
178+
### Including only third-party responses
179+
180+
On rare occasions you may wish to *only* store responses to third party domains. This can be done with the `--third-party` option.

go.mod

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
module git.detectify.net/tomhudson/page-fetch
2+
3+
go 1.15
4+
5+
require (
6+
github.com/chromedp/cdproto v0.0.0-20210526005521-9e51b9051fd0
7+
github.com/chromedp/chromedp v0.7.3
8+
golang.org/x/net v0.0.0-20210525063256-abc453219eb5
9+
)

go.sum

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
github.com/chromedp/cdproto v0.0.0-20210526005521-9e51b9051fd0 h1:aIcgRshD5I1MfJfB92KBDKpaXrYqj3fkqI8bHdtP3zA=
2+
github.com/chromedp/cdproto v0.0.0-20210526005521-9e51b9051fd0/go.mod h1:At5TxYYdxkbQL0TSefRjhLE3Q0lgvqKKMSFUglJ7i1U=
3+
github.com/chromedp/chromedp v0.7.3 h1:FvgJICfjvXtDX+miuMUY0NHuY8zQvjS/TcEQEG6Ldzs=
4+
github.com/chromedp/chromedp v0.7.3/go.mod h1:9gC521Yzgrk078Ulv6KIgG7hJ2x9aWrxMBBobTFk30A=
5+
github.com/chromedp/sysutil v1.0.0 h1:+ZxhTpfpZlmchB58ih/LBHX52ky7w2VhQVKQMucy3Ic=
6+
github.com/chromedp/sysutil v1.0.0/go.mod h1:kgWmDdq8fTzXYcKIBqIYvRRTnYb9aNS9moAV0xufSww=
7+
github.com/gobwas/httphead v0.1.0 h1:exrUm0f4YX0L7EBwZHuCF4GDp8aJfVeBrlLQrs6NqWU=
8+
github.com/gobwas/httphead v0.1.0/go.mod h1:O/RXo79gxV8G+RqlR/otEwx4Q36zl9rqC5u12GKvMCM=
9+
github.com/gobwas/pool v0.2.1 h1:xfeeEhW7pwmX8nuLVlqbzVc7udMDrwetjEv+TZIz1og=
10+
github.com/gobwas/pool v0.2.1/go.mod h1:q8bcK0KcYlCgd9e7WYLm9LpyS+YeLd8JVDW6WezmKEw=
11+
github.com/gobwas/ws v1.1.0-rc.5 h1:QOAag7FoBaBYYHRqzqkhhd8fq5RTubvI4v3Ft/gDVVQ=
12+
github.com/gobwas/ws v1.1.0-rc.5/go.mod h1:nzvNcVha5eUziGrbxFCo6qFIojQHjJV5cLYIbezhfL0=
13+
github.com/josharian/intern v1.0.0 h1:vlS4z54oSdjm0bgjRigI+G1HpF+tI+9rE5LLzOg8HmY=
14+
github.com/josharian/intern v1.0.0/go.mod h1:5DoeVV0s6jJacbCEi61lwdGj/aVlrQvzHFFd8Hwg//Y=
15+
github.com/mailru/easyjson v0.7.7 h1:UGYAvKxe3sBsEDzO8ZeWOSlIQfWFlxbzLZe7hwFURr0=
16+
github.com/mailru/easyjson v0.7.7/go.mod h1:xzfreul335JAWq5oZzymOObrkdz5UnU4kGfJJLY9Nlc=
17+
golang.org/x/net v0.0.0-20210525063256-abc453219eb5 h1:wjuX4b5yYQnEQHzd+CBcrcC6OVR2J1CN6mUy0oSxIPo=
18+
golang.org/x/net v0.0.0-20210525063256-abc453219eb5/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y=
19+
golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
20+
golang.org/x/sys v0.0.0-20201207223542-d4d67f95c62d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
21+
golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
22+
golang.org/x/sys v0.0.0-20210525143221-35b2ab0089ea h1:+WiDlPBBaO+h9vPNZi8uJ3k4BkKQB7Iow3aqwHVA5hI=
23+
golang.org/x/sys v0.0.0-20210525143221-35b2ab0089ea/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
24+
golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
25+
golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
26+
golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=

0 commit comments

Comments
 (0)