WIP: CAS: Refactoring CASDB to split out ActionCache #3679

dexonsmith · 2021-12-13T21:55:07Z

Currently a WIP. Goal of this branch is:

Design new CAS APIs that separate a few concepts out of CASDB: object storage vs. action cache vs. in-memory lifetime guarantees.
Split out the action cache.

Have a decent sketch for the first part, and I've started working on the second part.

Just now I split out two prep PRs:
#3678
#3677

I may end up splitting out other prep patches, not sure yet.

dexonsmith · 2021-12-13T21:56:43Z

@cachemeifyoucan @akyrtzi, FYI. Haven't actually split out ActionCache yet so it's not ready for review, but there's a long header comment on class CASDB that documents the intended design; let me know if you have comments there.

... and document a bunch of plans.

Note: surely does not compile at this point.

Note: surely, still nothing compiles

Add sys::fs::readNativeFileToLimit(). Whereas readNativeFileToEOF() reads an entire file, this just reads up to the provided limit.

Still need to thread through an ActionCache somewhere...

A few notes based on using ActionCache somewhere: - If ActionDescription owned its parameters it'd be easier to build. Seems a bit error-prone right now. - Needed to find a place to store the command-line for "extra data". If we wanted more stuff in ExtraData, this'd get pretty tough. - Needed to create a SmallVector to hold the InputIDs. - But, it's a nice property that ActionDescription is immutable. - If the content (not just the cache) is mutable then it's hard to reason about. - Potential fix: - Make ActionDescription own what it references. - Could allow the errors to be lazy about printing the action as well (store a copy of the description, and just print in `ErrorInfo::log()`). - Add ActionDescriptionBuilder, with simple mutating APIs, and a constructor `ActionDescription(ActionDescriptionBuilder&&)`. - ActionDescriptionRef is lightweight and can be used without building (... but maybe this one isn't useful in practice?) - Conversions between CASID and UniqueIDRef are annoying... but that's just a transition problem.

dexonsmith · 2021-12-21T03:15:05Z

@akyrtzi @cachemeifyoucan , I've started adopting the new ActionCache APIs in 6df65af -- that's just for TableGen. Let me know if you see any red flags.

(FYI, I'll clean up commit history before pushing...)

Avoid adding temporary bloat

akyrtzi · 2021-12-22T21:43:40Z

Full disclosure, I haven't looked at everything in the PR so most likely I'm missing context 😅. But I wanted to comment on the ActionCache interface.

Could you clarify a bit the functionality of cas::Namespace? Is this essentially a convenient way for the client to avoid key collisions by automatically adding the namespace to all keys? Or does it have more purposes than that?
I'm in favor of having an ActionDescriptionBuilder (as you mention in one of the commits) as convenient way to create an ActionDescription. That would provide convenient and type-safe APIs to include arbitrary serializable types (e.g. integer types); ActionDescription's StringRef ExtraData is hard to work with.
Assuming ActionDescriptionBuilder existed:
- Is there a particular benefit for ActionDescription's structure, beyond having a readable dump output?
- Would it make sense to essentially have a reference to the underlying hashed ID for querying the ActionCache? In a commit you mention "Make ActionDescription own what it references" but it's not clear to me how beneficial this is in practice, is ActionDescription intended to be something long-living?
  
  I have this view of ActionDescription as something transient that is created "on-the-fly" only for the purpose of interfacing with the ActionCache; the client should have a domain-specific way to model actions and ActionDescription is the way to compose the action key for them. Within this context the ActionDescriptionBuilder could play this role for creating what is semantically the ActionKey.
Could you clarify what is the difference between "Get a result from the cache" and "Get a result from the cache with extended lifetime"?

Previously we had a nice hack that the CASDB could be accessed from the CASFileSystem, but now we actually need to pass teh ActionCache through somehow.

dexonsmith · 2021-12-23T14:57:59Z

Could you clarify a bit the functionality of cas::Namespace? Is this essentially a convenient way for the client to avoid key collisions by automatically adding the namespace to all keys? Or does it have more purposes than that?

If you haven't read it, the long comment in CASDB might help. But maybe it's not clear.

It's more.

Previously, the high-level design was that the ActionCache and ObjectStore were part of the same thing (a CASDB). Any subclass of CASDB had to provide both. Any plug-in provided both.

Now, they are independently subclassed (note that CASDB hasn't changed names yet; that'll wait for ObjectStore/InMemoryView to be split apart). You can have a plug-in for one and a built-in for the other. E.g., you might use a plug-in for some 3rd party CAS but not hook up a custom ActionCache; clang will want to spin up an in-memory cache.

Three things can vary between different ObjectStore implementations:

Hash function (SHA256 vs BLAKE3)
Number of bits used from hash (24 bits from BLAKE3 vs all 32)
Schema for objects (what is the layout of a tree, and what precise bits are hashed? Is Blob always a single underlying object or are big ones split up and stored as a DAG under the hood? Are Nodes supported natively or are they emulated using either Blob or Tree?)

This variance means that the IDs (previously CASID, now UniqueIDRef/UniqueID) are not universal identifiers for objects. They are identifiers that belong to a specific namespace.

Comparing IDs in the namespace, assuming the hash is strong enough, they are equal if and only if they represent the same semantic object. (Same name, different object.)
Between different namespaces, they may be equal even if the objects are unrelated. They will usually be unequal even if they refer to the equivalent object. (Different name, same object.)

However, two different ObjectStore instances may be in the same namespace. E.g.:

Create two in-memory CAS instances in a row. They have different storage, but UniqueIDs mean the same thing.
Create each of an in-memory and an on-disk object store. If they're both built-in (and assuming they use the same hash/schema; currently it's not configurable but it may be in the future) they have the same namespace.

It's important to know when ObjectStores are in the same namespace. Consider using one as a local cache for another. If they are in the same namespace, you can pass through IDs. If they are not, you need to create/maintain a mapping between them.

Going back to ActionCache.

Before this PR, an ActionCache was a priori tied to the same namespace as the ObjectStore, since they were in the same CASDB. No need to validate.

In this PR, ActionCache is split out as a separate concept. But it is still required to be tied to a specific namespace. That means:

If you get a cache hit with get(), you know the cached result ID refers to an object that can retrieved from any ObjectStore in the same namespace.
If you get a cache collision with put(), you know it's because the action is missing inputs or is not reproducible, not because the IDs are from different namespaces.

Let me know if it's still not clear and/or what could/should be added to header docs.

dexonsmith · 2021-12-23T15:27:59Z

[...] Within this context the ActionDescriptionBuilder could play this role for creating what is semantically the ActionKey.

That would be simpler in a lot of ways. Maybe it'd be better.

Main reason I avoided this was because I didn't want to enforce anything on the ActionCache implementation. Maybe some impls want the key to be an action description serialization? Maybe some want to store a checksum of the description to detect hash collisions?

You also talk about ActionDescription as ephemeral, and with its current API that probably makes sense. But two useful directions this could make challenging:

Add first-class support for actions to refer to each other in a DAG.
Add built-in (de)serialization support for actions so an action graph can be stored.

Another problem is that each ActionKey effectively has a namespace of some sort tied to the hash function (or serialization) used by its ActionCache. Probably not too hard to solve though.

(I think what what makes sense to me is for an Action to be a second-class object type (built on Node) that can be hashed by an ObjectHasher (parent class of ObjectStore), and the ActionKey is a wrapper around a UniqueID from that namespace. But: you don't need to store an action to get its key, and the ObjectHasher used for the actions in an actiongraph doesn't need to be in the same namespace as the one that the cached results are in; and if actions are stored, the lifetime of the actiongraph object store can be distinct from the one for the cached results (e.g., each build directory might want a CAS to store action graphs for communicating actions, allowing processes to refer to actions by their ActionKey).)

dexonsmith · 2021-12-23T15:47:28Z

Could you clarify what is the difference between "Get a result from the cache" and "Get a result from the cache with extended lifetime"?

One returns UniqueID. This owns storage to the hash, like a std::string. Any key-value store can implement this trivially, but the type is bulkier... and it's hard to use with the existing CASDB because CASIDs are expected to reference memory owned by the CAS.

The other returns UniqueIDRef. This references unowned memory, like a StringRef. This API must guarantee reference validity as long as the ActionCache lives.

The second API may come with extra cost for some ActionCache implementations. It's possible we'd want to prefer using the former API. But pretty hard to reason about with CASDB's current design.

Note the difference in APIs between ObjectStore (no assumption of lifetime guarantees, using UniqueID) and its subclass InMemoryView (lifetime extended, using UniqueIDRef in things like InMemoryTree).

(Note that CASID aligns with UniqueIDRef.)

cachemeifyoucan

Just 3 general questions after reading the implementation. The main concern is this is very complicated to use. I don't know you have some design in your mind specifically that needed the complexity.

cachemeifyoucan · 2022-01-03T23:13:41Z

llvm/include/llvm/CAS/ActionCache.h

+  /// \post Subsequent call to \a get() with \p Action returns \p Result.
+  /// Returns an error if there is a \a get() will not return
+  Error put(const ActionDescription &Action, UniqueIDRef Result) {
+    assert(&Result.getNamespace() == &getNamespace());


I thought we are decouple ActionCache from CAS that store the artifacts, but this assertion means they are still in the same namespace?

The cache is decoupled from storage, but you need some way of knowing whether the UniqueIDs that they both refer to mean the same thing. Otherwise, the action cache might store an ID from an LLVM builtin CAS and then you try to look the object up in a custom CAS of some sort where the ID has a different meaning.

Sharing a "namespace" tells you that a UniqueID has the same meaning.

I guess I wasn't clear what is covered by a Namespace. Is that Key, or Value, or the mapping between Key and Value?

The Value/Result from ActionCache should be available as Key in ObjectStore but is very different from Value/Result in ObjectStore. I guess your definition of namespace is that everything within the namespace has uniqueID references the same CAS/ObjectStore?

If that is the case, does this address the original concern that ActionCache key can just be a hash that doesn't anything in the object store? If that is possible, then what is the namespace of ActionCache (whose key and value are in different namespace)?

The object store and action cache are fundamentally different. One is a set and the other is a map.

An ObjectStore is a set of objects. It has a schema for creating a content-derived unique identifier/handle for arbitrary objects, which can be used as a handle for retrieving on object later. This identifier must be stable, not dependent on state of the store itself, and uniquely identifying, not having collisions. So, a strong hash. While it is somewhat convenient to think of it as a map from a hash to an object, that's not really what's going on... it's really a set of objects, and objects have an inherent UID defined by their content.

This UID is universal (it's a UUID), assuming the universe is a particular CAS schema.

For example, the SHA1 identifiers used by Git are UUIDs in the context of Git. You can have multiple Git repos, each that have never talked to each other, and if you ingest the same filesystem directory you'll get the same UID for the tree object. But ingest that same directory into, say, a Bazel instance, and you'll get a different UID. This is again a UUID, but only in the context of Bazel.

You cannot take UIDs from one namespace and use them directly in another namespace.

An ActionCache is a map from an action (key) to a CAS object (value).

At a high-level, an action (key) is just an arbitrary string. A particular action cache need not even store the string; maybe it'll use a strong hash algorithm and just store the hash. Different action cache implementations may make different choices here; could be a key-value store from the serialized action, or maybe from the hash of the serialized action. The ActionDescription provides a framework for serializing a complex action into a string (or directly into a hash).

The CAS object (value) is not stored in the cache. Instead, only its UID is stored there.

The ActionCache needs to know the namespace of the CAS objects it stores. Is this mapping to UIDs for a Git CAS, a Bazel CAS, or an LLVM builtin CAS?

I guess I wasn't clear what is covered by a Namespace. Is that Key, or Value, or the mapping between Key and Value?

A Namespace defines the "universe" for UIDs. It answers the question for a UID: in what context are these universal / UUIDs? At a high-level, there is one namespace for each schema that converts from object content to its handle/UID. There would be one namespace referenced by all ObjectStores wrapping a Git CAS, another for all ObjectStores wrapping a Bazel CAS, and another for all ObjectStores wrapping a builtin CAS. If we end up supporting alternative hash functions or UID lengths for the builtin CAS, then there will need to be a different namespace for each hash function.

The Value/Result from ActionCache should be available as Key in ObjectStore but is very different from Value/Result in ObjectStore. I guess your definition of namespace is that everything within the namespace has uniqueID references the same CAS/ObjectStore?

No. You can have lots of ObjectStores (instances of a CAS) that all use the same namespace.

If that is the case, does this address the original concern that ActionCache key can just be a hash that doesn't anything in the object store?

Yes, that's part of it.

If that is possible, then what is the namespace of ActionCache (whose key and value are in different namespace)?

An ActionCache maps from arbitrary string to CAS objects. But it stores UIDs from the CAS objects' namespace. The ActionCache's namespace is the namespace for those CAS objects.

In theory, we could allow an ActionCache to store CAS objects from multiple namespaces. E.g., map "action1" to a Git object, and "action2" to a Bazel object. But I think this would be hard to use. Every time a client looked up a cached result, they'd need to check if the result was in the same namespace as the ObjectStore they have available before looking up the object.

Instead, the current design requires that an action cache only operates on CAS objects in a single namespace. Most parts of the compiler can reasonably assume that if they're handed an ObjectStore and an ActionCache, they're in the same namespace. They can verify this by asserting that getNamespace() returns the same thing... but there's no need to check namespaces after each get().

cachemeifyoucan · 2022-01-04T00:55:03Z

llvm/include/llvm/CAS/CASDB.h

+///       until the daemon shuts down but sharing work as long as the daemon
+///       lives
+///
+/// "Capture" out-parameters:


I wonder if this is too complicated. Sounds much simpler if we can just offload the decision to the CAS implementation and the user of CASDB just uses whatever is provided. I feel like the CAS implementation need to know about the life time if that is a concern (for example, don't invalidate the tree while it is being visited). Maybe ref count the return buffer and notify the CAS on release?

Yes, it is a bit complicated.

Note that CASDB dies and is replaced by InMemoryView (which has a high-level API) and ActionCache (for caching).

What you're commenting on here are low-level ObjectStore APIs (which InMemoryView will layer on top of). But you're referring to a CASDB. What do you mean by CASDB?

Assuming you mean ObjectStore: the eventual goal is to move the underlying ObjectStore out of process (into a daemon), with object content passed via shared memory. Since the protocol has not been designed, it's a bit hard to reason about. I'm trying to leave room for something fairly low-level that doesn't require per-object allocations and reference counting. For example, perhaps small objects can most efficiently be sent across the process boundary by using a large-ish, sharded, shared memory region that's used as a scratch buffer (or, more simply, over the socket), whereas large objects would have a dedicated mmap'ed memory region.

for example, don't invalidate the tree while it is being visited

This sounds like a concern for a user of InMemoryView; WDYT?

Didn't not aware of all the terminology changes. But yes, I mean ObjectStore.

The main concern I have is that if the compiler need to constantly aware of which low level protocol it needs to use for different object vs. the decision is entirely made by the CAS/plugin. I would highly prefer the later since that makes the life of people who do not want to think about CAS easier, have a consistent interface which is easier to wrapper under something like a file system, and much easier to scale with different plugins.

I expect there to be almost no direct users of ObjectStore: instead, it'll be adapted by an InMemoryView to provide lifetime. Unit tests may use the object store directly for testing the low-level interface, and maybe there will be other special cases.

It sounds to me like the use cases you're describing would be interacting with InMemoryView. WDYT?

That means InMemoryView need to be tuned for different plugins? That sounds better but I am not sure if those decisions should be made within LLVM.

The client (usually InMemoryView) doesn't have to tune. The design of the low-level ObjectStore APIs is that the client passes a Capture object (usually InMemoryView) which includes the capabilities the client has (for blobs, this could include raw_ostream& (copy bytes), StringRef & (permanent lifetime), mapped_file_region (ownership), etc.) and an indication of its preferences (e.g., only use mapped_file_region for blobs between 16K and 2GB) and then the object store implementation/plugin sends the object over using one of the mechanisms.

For something a bit more concrete, see the incomplete sketch of BlobCapture/TreeCapture/NodeCapture deleted at dexonsmith@6dffd64#diff-5d292c70f2336e2afef84ce13f5df8b6ad948290677fa37618de819a6c9d3d70, coming from:
https://github.com/dexonsmith/llvm-project/commits/experimental/cas/refactor-casdb-before-deleting-history/llvm/include/llvm/CAS/CASDB.h

cachemeifyoucan · 2022-01-04T01:04:30Z

llvm/include/llvm/CAS/Namespace.h

+///
+/// Instances should be declared \a ManagedStatic to compare by pointer
+/// identity.
+class Namespace {


I wonder why we need Namespace instead of just put everything in CASDB and reference CASDB instead?

Two CAS instances can be in the same namespace. For example, consider creating the following five builtin CAS instances:

std::unique_ptr<ObjectStore> CASes[] = { createInMemoryCAS(), createInMemoryCAS(), // independent in-memory CAS instance createOnDiskCAS("/path/1"), // on-disk storage createOnDiskCAS("/path/1"), // same on-disk storage, but different in-memory instance createOnDiskCAS("/path/2"), // different on-disk storage };

These assertions should succeed:

const Namespace &NS = CASes[0].getNamespace(); for (auto &CAS : llvm::make_range(CASes)) assert(&NS == &CAS.getNamespace());

This property seems critical to me for decoupling ActionCache from ObjectStore. Otherwise, how does ActionCache return a UniqueID, if the UniqueID needs to reference an ObjectStore?

cachemeifyoucan

I fix the build here: https://github.com/cachemeifyoucan/llvm-project/tree/CASDB-refactoring-build-fix

I can send a PR to your PR if needed.

I agree that we should probably have a better interface for ActionDecription. In my branch, I still use a tree builder to build a CASID in CAS storage then use that to build ActionDescription then index ActionCache. That feels a bit weird but it can probably easily be improved.

cachemeifyoucan · 2022-01-06T19:18:55Z

llvm/lib/CAS/CASDB.cpp


+
+Expected<CASID> CASDB::parseCASID(StringRef Reference) {
+  UniqueID ID;


This function has a lifetime issue as UniqueID is free'd as function returned.

Nice catch; this should be pushing the IDs onto a bumpptrallocator like BuiltinCAS::parseCASID was before; I'll fix.

dexonsmith · 2022-01-11T03:56:57Z

I fix the build here: https://github.com/cachemeifyoucan/llvm-project/tree/CASDB-refactoring-build-fix

Awesome, thanks, I'll take a look.

I can send a PR to your PR if needed.

No need; I'll probably cherry-pick and then hack on it from there.

I agree that we should probably have a better interface for ActionDecription. In my branch, I still use a tree builder to build a CASID in CAS storage then use that to build ActionDescription then index ActionCache. That feels a bit weird but it can probably easily be improved.

Thanks!

This reverts commit a5e35d6.

Cribbed from #3679

Cribbed from swiftlang#3679

dexonsmith self-assigned this Dec 13, 2021

dexonsmith added 3 commits December 14, 2021 16:01

WIP: CAS: Add Namespace and start threading through

e400230

... and document a bunch of plans.

WIP: Start adding tests for Namespace and UniqueID

d71e6f9

WIP: Finish testing UniqueID / Namespace / FlatUniqueIDArrayRef

87d172b

dexonsmith force-pushed the experimental/cas/refactor-casdb branch from 000aa0f to 87d172b Compare December 15, 2021 00:07

dexonsmith added 15 commits December 14, 2021 16:28

Test the DenseMap info

9055169

CAS: Start sketching out ActionCache

3efc3a5

WIP: start implementing in-memory action cache

abc7871

WIP: Mostly finish in-memory, and start on-disk action cache

1cb2ec1

Note: surely does not compile at this point.

WIP: Mostly finish on-disk action cache

3dd4122

Note: surely, still nothing compiles

Update header docs

47727b1

WIP: Get most of LLVMCAS compiling

f8a7f42

WIP: LLVMCAS and CASTests compile

5900137

fixup: temporarily revert readNativeFileToLimit to split out and test it

c7f4be1

Support: Add sys::fs::readNativeFileToLimit()

1a231ad

Add sys::fs::readNativeFileToLimit(). Whereas readNativeFileToEOF() reads an entire file, this just reads up to the provided limit.

Test ActionDescription

245f8d8

Test ActionCache

1970a6e

add missing final

9aaf46c

WIP: Remove result cache support from CASDB

4859af2

Still need to thread through an ActionCache somewhere...

dexonsmith added 3 commits December 20, 2021 19:33

CAS: Factor out unnecessary calls to getCASID()

1cecc05

Avoid adding temporary bloat

BuiltinCAS: Push UniqueIDRef through a bit further

c830d92

WIP: Maybe add conversions to UniqueIDRef?

34c2a68

Handle most of the complications for CompilerInvocation

4b1f6ee

Previously we had a nice hack that the CASDB could be accessed from the CASFileSystem, but now we actually need to pass teh ActionCache through somehow.

cachemeifyoucan reviewed Jan 4, 2022

View reviewed changes

cachemeifyoucan reviewed Jan 6, 2022

View reviewed changes

cachemeifyoucan added 2 commits January 6, 2022 14:35

Fix clang build

5618305

[CASDB] Fix a lifetime issue of parseCASID

a5e35d6

dexonsmith added 2 commits January 21, 2022 15:56

Revert "[CASDB] Fix a lifetime issue of parseCASID"

3c34d2a

This reverts commit a5e35d6.

CAS: Fix lifetime issue in CASDB::parseCASID

3d94e5c

dexonsmith added a commit that referenced this pull request Jan 26, 2022

CAS: Improve docs for trie data structures

dfe3ad1

Cribbed from #3679

dexonsmith added a commit to dexonsmith/llvm-project that referenced this pull request Jan 26, 2022

CAS: Ownership simplification for internals of OnDiskHashMappedTrie

f5520a2

Cribbed from swiftlang#3679

dexonsmith mentioned this pull request Jan 26, 2022

CAS: Ownership simplification for internals of OnDiskHashMappedTrie #3844

Merged



		Expected<CASID> CASDB::parseCASID(StringRef Reference) {
		UniqueID ID;

WIP: CAS: Refactoring CASDB to split out ActionCache #3679

Are you sure you want to change the base?

WIP: CAS: Refactoring CASDB to split out ActionCache #3679

Uh oh!

Conversation

dexonsmith commented Dec 13, 2021

Uh oh!

dexonsmith commented Dec 13, 2021

Uh oh!

dexonsmith commented Dec 21, 2021

Uh oh!

akyrtzi commented Dec 22, 2021

Uh oh!

dexonsmith commented Dec 23, 2021

Uh oh!

dexonsmith commented Dec 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dexonsmith commented Dec 23, 2021

Uh oh!

cachemeifyoucan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cachemeifyoucan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dexonsmith commented Jan 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dexonsmith commented Dec 23, 2021 •

edited

Loading