This project documention is moving to the ArchiveTeam wiki, rather than just my little corner of GitHub. See it here. With that, here are some notes on what I've done so far.
While all CVEs are special, KEV references are extra special.
Thanks to our pals at NVD, this is super easy:
curl --location 'https://services.nvd.nist.gov/rest/json/cves/2.0?hasKev'|
jq '.vulnerabilities[].cve |
{"id": .id, "refs": [.references[].url]}' >
kev-refs.json
There are currently 4611 unique KEV references, according to
cat kev-refs.json| jq '.refs[]' | sort | uniq | wc -l
(as of today).
This seems tractable.
ArchiveBox is pretty easy to use, and gettable at https://archivebox.io/#quickstart. Of course, just running my own ArchiveBox instance doesn't really solve the archiving problem. What I'd like to do is collect a simple snapshot of mid- to low-quality, tag it with a CVE ID, and then start releasing those on torrents once a month or so. In the meantime, we /could/ run our own ArchiveBox instance, and use ArchiveBox's inherent tagging functionality to tag each archive with the relevant CVE ID.
If you want access to Tod's install of ArchiveBox, ask him for a user/pass to Armitage (Note the http-ness, we should help them get an out-of-the-box HTTPS listener going for security's sake.)
I tested out the "Get it all" functionality with five references, mostly to see how much disk space these references are going to take, and also to test how reliable the archiving is.
Of the five (Oracle, Microsoft, Twitter, Apple, and ZDI), only the ZDI reference seemed to run into trouble with archiving (and we'll want to look into why that is).
The sizes are a little big when you get all the different outputs, but the chromium singlefile snapshot seems reasonable at the largest being 1.4mb:
There's a group of mad archivists out there, ArchiveTeam, who work with, but are distinct from, Archive.org. They hang out on IRC, are opinionated, and probably could help with this whole thing. I'll talk to them.
Talked to ArchiveTeam. they are very helpful! I was advised to stash all the cve URLs on their site, so they're there now, as of May 17, 2023:
- https://transfer.archivete.am/P6uNh/cve-refs.txt
- https://transfer.archivete.am/EZdNi/cve-twitter-refs.txt
I was also invited to move this thing over to https://wiki.archiveteam.org/, so I registered as a new user and I'll do that at some point soon after I figure out how to not embarrass myself.
So let's get all CVE references. the NVD API isn't great for this since it's paginated at 2,000 records per request, but we don't really need the API, let's just get the whole dang thing and work it locally:
https://www.cve.org/Downloads#current-format
ArchiveTeam is a little spooked by Twitter references, specifically, since they're known to be so javascript heavy. I'm sure there are other references in there that are also JS-ish (Facebook posts and such), but no matter.
Since the downloaded zip is broken out by year, combine them into one, after getting rid of the metadata json file recent-activity.json
find . -name '*.json' -exec cat {} \; | jq -s '.' > all-cves.json
then
cat all-cves.json | jq '.[] | select(.cveMetadata.state | contains("PUBLISHED")) | {"id": .cveMetadata.cveId, "refs": [.containers.cna.references[].url]} ' all-refs.json
Next, create plain text files of these references:
cat all-refs.json | jq -r '.refs[]' | sort | uniq > all-urls.txt
Twist! Some of those references are on FTP sites. Let's kick those out for simplicity's sake, we'll come back to those (and I would be amazed if really any of them still survive).
cat all-refs.json | jq -r '.refs[] | select(.|startswith("http"))' | sort | uniq > cve-refs.txt
Note! jq's startswith
is case sensitive, but with a quick grep, there are, amazingly, no uppercase HTTP:// urls so we're good. If you would like your CVE references to avoid being archived with this scheme, use HTTP and HTTPS and not http or https, I guess. (I considered just ascii_downcase
ing the input but URI paths are case-sensitive (while protocol handlers are not), so kinda would rather not).
Also, kicking out the Twitter references for now, and I think I got all that's going to be got anyway, documented over here:
cat cve-refs.txt | grep -v 'twitter.com/' > cve-refs-minus-twitter.txt
and
cat cve-refs.txt | grep 'twitter.com/' > cve-refs-just-twitter.txt
Couldn't be easier:
curl --upload-file cve-refs-minus-twitter.txt https://transfer.archivete.am/cve-refs.txt
and
curl --upload-file cve-refs-just-twitter.txt https://transfer.archivete.am/cve-twitter-refs.txt
So let's see what works and what doesn't
When adding all those 4611 URLs to Archivebox and ticking just the "archive.org" archive format, that finishes basically immediately. Mostly all of these links were already archived by someone, so that's pretty handy and good to know.
Churning through this now (on May 8, 2023) so I'll report back when that's done and see what broke.

So it looks like I have to do fewer at a time, and be more mindful of I/O limitations.
Coming soon, probably. Seems like it should be easy, but I do need to figure out how the command line / API interface actually works.