Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
all-kev-refs-2023-05-17.json		all-kev-refs-2023-05-17.json
all-refs-2023-05-17.json.zip		all-refs-2023-05-17.json.zip

README.md

See ArchiveTeam!

This project documention is moving to the ArchiveTeam wiki, rather than just my little corner of GitHub. See it here. With that, here are some notes on what I've done so far.

Archiving KEV references

While all CVEs are special, KEV references are extra special.

Getting all KEV references

Thanks to our pals at NVD, this is super easy:

curl --location 'https://services.nvd.nist.gov/rest/json/cves/2.0?hasKev'| 
  jq '.vulnerabilities[].cve | 
  {"id": .id, "refs": [.references[].url]}' > 
  kev-refs.json

There are currently 4611 unique KEV references, according to cat kev-refs.json| jq '.refs[]' | sort | uniq | wc -l (as of today).

This seems tractable.

ArchiveBox

ArchiveBox is pretty easy to use, and gettable at https://archivebox.io/#quickstart. Of course, just running my own ArchiveBox instance doesn't really solve the archiving problem. What I'd like to do is collect a simple snapshot of mid- to low-quality, tag it with a CVE ID, and then start releasing those on torrents once a month or so. In the meantime, we /could/ run our own ArchiveBox instance, and use ArchiveBox's inherent tagging functionality to tag each archive with the relevant CVE ID.

If you want access to Tod's install of ArchiveBox, ask him for a user/pass to Armitage (Note the http-ness, we should help them get an out-of-the-box HTTPS listener going for security's sake.)

Samples

I tested out the "Get it all" functionality with five references, mostly to see how much disk space these references are going to take, and also to test how reliable the archiving is.

Of the five (Oracle, Microsoft, Twitter, Apple, and ZDI), only the ZDI reference seemed to run into trouble with archiving (and we'll want to look into why that is).

The sizes are a little big when you get all the different outputs, but the chromium singlefile snapshot seems reasonable at the largest being 1.4mb:

ArchiveTeam

There's a group of mad archivists out there, ArchiveTeam, who work with, but are distinct from, Archive.org. They hang out on IRC, are opinionated, and probably could help with this whole thing. I'll talk to them.

ArchiveTeam is Rad

Talked to ArchiveTeam. they are very helpful! I was advised to stash all the cve URLs on their site, so they're there now, as of May 17, 2023:

I was also invited to move this thing over to https://wiki.archiveteam.org/, so I registered as a new user and I'll do that at some point soon after I figure out how to not embarrass myself.

Extracting all references

So let's get all CVE references. the NVD API isn't great for this since it's paginated at 2,000 records per request, but we don't really need the API, let's just get the whole dang thing and work it locally:

https://www.cve.org/Downloads#current-format

ArchiveTeam is a little spooked by Twitter references, specifically, since they're known to be so javascript heavy. I'm sure there are other references in there that are also JS-ish (Facebook posts and such), but no matter.

Since the downloaded zip is broken out by year, combine them into one, after getting rid of the metadata json file recent-activity.json

find . -name '*.json' -exec cat {} \; | jq -s '.' > all-cves.json

then

cat all-cves.json | jq '.[] | select(.cveMetadata.state | contains("PUBLISHED")) | {"id": .cveMetadata.cveId, "refs": [.containers.cna.references[].url]} ' all-refs.json

Next, create plain text files of these references:

cat all-refs.json | jq -r '.refs[]' | sort | uniq > all-urls.txt

Twist! Some of those references are on FTP sites. Let's kick those out for simplicity's sake, we'll come back to those (and I would be amazed if really any of them still survive).

Note! jq's startswith is case sensitive, but with a quick grep, there are, amazingly, no uppercase HTTP:// urls so we're good. If you would like your CVE references to avoid being archived with this scheme, use HTTP and HTTPS and not http or https, I guess. (I considered just ascii_downcaseing the input but URI paths are case-sensitive (while protocol handlers are not), so kinda would rather not).

Also, kicking out the Twitter references for now, and I think I got all that's going to be got anyway, documented over here:

cat cve-refs.txt | grep -v 'twitter.com/' > cve-refs-minus-twitter.txt

and

cat cve-refs.txt | grep 'twitter.com/' > cve-refs-just-twitter.txt

Uploading to transfer.archivete.am

Couldn't be easier:

curl --upload-file cve-refs-minus-twitter.txt https://transfer.archivete.am/cve-refs.txt

and

curl --upload-file cve-refs-just-twitter.txt https://transfer.archivete.am/cve-twitter-refs.txt

Archiving exercises: ArchiveBox

So let's see what works and what doesn't

Just archive.org links

When adding all those 4611 URLs to Archivebox and ticking just the "archive.org" archive format, that finishes basically immediately. Mostly all of these links were already archived by someone, so that's pretty handy and good to know.

Singlepage.html

Churning through this now (on May 8, 2023) so I'll report back when that's done and see what broke.

Yep, that broke

So it looks like I have to do fewer at a time, and be more mindful of I/O limitations.

Auto-archiving new CVE ID references

Coming soon, probably. Seems like it should be easy, but I do need to figure out how the command line / API interface actually works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cve-kev-refs

cve-kev-refs

README.md

See ArchiveTeam!

Archiving KEV references

Getting all KEV references