-
Notifications
You must be signed in to change notification settings - Fork 351
WIP: CAS: Refactoring CASDB to split out ActionCache #3679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: experimental/cas/main
Are you sure you want to change the base?
WIP: CAS: Refactoring CASDB to split out ActionCache #3679
Conversation
|
@cachemeifyoucan @akyrtzi, FYI. Haven't actually split out ActionCache yet so it's not ready for review, but there's a long header comment on |
000aa0f to
87d172b
Compare
Note: surely does not compile at this point.
Note: surely, still nothing compiles
Add sys::fs::readNativeFileToLimit(). Whereas readNativeFileToEOF() reads an entire file, this just reads up to the provided limit.
Still need to thread through an ActionCache somewhere...
A few notes based on using ActionCache somewhere:
- If ActionDescription owned its parameters it'd be easier to build.
Seems a bit error-prone right now.
- Needed to find a place to store the command-line for "extra data".
If we wanted more stuff in ExtraData, this'd get pretty tough.
- Needed to create a SmallVector to hold the InputIDs.
- But, it's a nice property that ActionDescription is immutable.
- If the content (not just the cache) is mutable then it's hard to
reason about.
- Potential fix:
- Make ActionDescription own what it references.
- Could allow the errors to be lazy about printing the action as
well (store a copy of the description, and just print in
`ErrorInfo::log()`).
- Add ActionDescriptionBuilder, with simple mutating APIs, and a
constructor `ActionDescription(ActionDescriptionBuilder&&)`.
- ActionDescriptionRef is lightweight and can be used without
building (... but maybe this one isn't useful in practice?)
- Conversions between CASID and UniqueIDRef are annoying... but that's
just a transition problem.
|
@akyrtzi @cachemeifyoucan , I've started adopting the new (FYI, I'll clean up commit history before pushing...) |
|
Full disclosure, I haven't looked at everything in the PR so most likely I'm missing context 😅. But I wanted to comment on the
|
Previously we had a nice hack that the CASDB could be accessed from the CASFileSystem, but now we actually need to pass teh ActionCache through somehow.
If you haven't read it, the long comment in CASDB might help. But maybe it's not clear. It's more. Previously, the high-level design was that the ActionCache and ObjectStore were part of the same thing (a CASDB). Any subclass of CASDB had to provide both. Any plug-in provided both. Now, they are independently subclassed (note that CASDB hasn't changed names yet; that'll wait for ObjectStore/InMemoryView to be split apart). You can have a plug-in for one and a built-in for the other. E.g., you might use a plug-in for some 3rd party CAS but not hook up a custom ActionCache; clang will want to spin up an in-memory cache. Three things can vary between different ObjectStore implementations:
This variance means that the IDs (previously CASID, now UniqueIDRef/UniqueID) are not universal identifiers for objects. They are identifiers that belong to a specific namespace.
However, two different ObjectStore instances may be in the same namespace. E.g.:
It's important to know when ObjectStores are in the same namespace. Consider using one as a local cache for another. If they are in the same namespace, you can pass through IDs. If they are not, you need to create/maintain a mapping between them. Going back to ActionCache. Before this PR, an ActionCache was a priori tied to the same namespace as the ObjectStore, since they were in the same CASDB. No need to validate. In this PR, ActionCache is split out as a separate concept. But it is still required to be tied to a specific namespace. That means:
Let me know if it's still not clear and/or what could/should be added to header docs. |
That would be simpler in a lot of ways. Maybe it'd be better. Main reason I avoided this was because I didn't want to enforce anything on the ActionCache implementation. Maybe some impls want the key to be an action description serialization? Maybe some want to store a checksum of the description to detect hash collisions? You also talk about ActionDescription as ephemeral, and with its current API that probably makes sense. But two useful directions this could make challenging:
Another problem is that each ActionKey effectively has a namespace of some sort tied to the hash function (or serialization) used by its ActionCache. Probably not too hard to solve though. (I think what what makes sense to me is for an Action to be a second-class object type (built on Node) that can be hashed by an ObjectHasher (parent class of ObjectStore), and the ActionKey is a wrapper around a UniqueID from that namespace. But: you don't need to store an action to get its key, and the ObjectHasher used for the actions in an actiongraph doesn't need to be in the same namespace as the one that the cached results are in; and if actions are stored, the lifetime of the actiongraph object store can be distinct from the one for the cached results (e.g., each build directory might want a CAS to store action graphs for communicating actions, allowing processes to refer to actions by their ActionKey).) |
One returns UniqueID. This owns storage to the hash, like a std::string. Any key-value store can implement this trivially, but the type is bulkier... and it's hard to use with the existing CASDB because CASIDs are expected to reference memory owned by the CAS. The other returns UniqueIDRef. This references unowned memory, like a StringRef. This API must guarantee reference validity as long as the ActionCache lives. The second API may come with extra cost for some ActionCache implementations. It's possible we'd want to prefer using the former API. But pretty hard to reason about with CASDB's current design. Note the difference in APIs between ObjectStore (no assumption of lifetime guarantees, using UniqueID) and its subclass InMemoryView (lifetime extended, using UniqueIDRef in things like InMemoryTree). (Note that CASID aligns with UniqueIDRef.) |
cachemeifyoucan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 3 general questions after reading the implementation. The main concern is this is very complicated to use. I don't know you have some design in your mind specifically that needed the complexity.
| /// \post Subsequent call to \a get() with \p Action returns \p Result. | ||
| /// Returns an error if there is a \a get() will not return | ||
| Error put(const ActionDescription &Action, UniqueIDRef Result) { | ||
| assert(&Result.getNamespace() == &getNamespace()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we are decouple ActionCache from CAS that store the artifacts, but this assertion means they are still in the same namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache is decoupled from storage, but you need some way of knowing whether the UniqueIDs that they both refer to mean the same thing. Otherwise, the action cache might store an ID from an LLVM builtin CAS and then you try to look the object up in a custom CAS of some sort where the ID has a different meaning.
Sharing a "namespace" tells you that a UniqueID has the same meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I wasn't clear what is covered by a Namespace. Is that Key, or Value, or the mapping between Key and Value?
The Value/Result from ActionCache should be available as Key in ObjectStore but is very different from Value/Result in ObjectStore. I guess your definition of namespace is that everything within the namespace has uniqueID references the same CAS/ObjectStore?
If that is the case, does this address the original concern that ActionCache key can just be a hash that doesn't anything in the object store? If that is possible, then what is the namespace of ActionCache (whose key and value are in different namespace)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object store and action cache are fundamentally different. One is a set and the other is a map.
An ObjectStore is a set of objects. It has a schema for creating a content-derived unique identifier/handle for arbitrary objects, which can be used as a handle for retrieving on object later. This identifier must be stable, not dependent on state of the store itself, and uniquely identifying, not having collisions. So, a strong hash. While it is somewhat convenient to think of it as a map from a hash to an object, that's not really what's going on... it's really a set of objects, and objects have an inherent UID defined by their content.
This UID is universal (it's a UUID), assuming the universe is a particular CAS schema.
For example, the SHA1 identifiers used by Git are UUIDs in the context of Git. You can have multiple Git repos, each that have never talked to each other, and if you ingest the same filesystem directory you'll get the same UID for the tree object. But ingest that same directory into, say, a Bazel instance, and you'll get a different UID. This is again a UUID, but only in the context of Bazel.
You cannot take UIDs from one namespace and use them directly in another namespace.
An ActionCache is a map from an action (key) to a CAS object (value).
- At a high-level, an action (key) is just an arbitrary string. A particular action cache need not even store the string; maybe it'll use a strong hash algorithm and just store the hash. Different action cache implementations may make different choices here; could be a key-value store from the serialized action, or maybe from the hash of the serialized action. The ActionDescription provides a framework for serializing a complex action into a string (or directly into a hash).
- The CAS object (value) is not stored in the cache. Instead, only its UID is stored there.
- The ActionCache needs to know the namespace of the CAS objects it stores. Is this mapping to UIDs for a Git CAS, a Bazel CAS, or an LLVM builtin CAS?
I guess I wasn't clear what is covered by a Namespace. Is that Key, or Value, or the mapping between Key and Value?
A Namespace defines the "universe" for UIDs. It answers the question for a UID: in what context are these universal / UUIDs? At a high-level, there is one namespace for each schema that converts from object content to its handle/UID. There would be one namespace referenced by all ObjectStores wrapping a Git CAS, another for all ObjectStores wrapping a Bazel CAS, and another for all ObjectStores wrapping a builtin CAS. If we end up supporting alternative hash functions or UID lengths for the builtin CAS, then there will need to be a different namespace for each hash function.
The Value/Result from ActionCache should be available as Key in ObjectStore but is very different from Value/Result in ObjectStore. I guess your definition of namespace is that everything within the namespace has uniqueID references the same CAS/ObjectStore?
No. You can have lots of ObjectStores (instances of a CAS) that all use the same namespace.
If that is the case, does this address the original concern that ActionCache key can just be a hash that doesn't anything in the object store?
Yes, that's part of it.
If that is possible, then what is the namespace of ActionCache (whose key and value are in different namespace)?
An ActionCache maps from arbitrary string to CAS objects. But it stores UIDs from the CAS objects' namespace. The ActionCache's namespace is the namespace for those CAS objects.
In theory, we could allow an ActionCache to store CAS objects from multiple namespaces. E.g., map "action1" to a Git object, and "action2" to a Bazel object. But I think this would be hard to use. Every time a client looked up a cached result, they'd need to check if the result was in the same namespace as the ObjectStore they have available before looking up the object.
Instead, the current design requires that an action cache only operates on CAS objects in a single namespace. Most parts of the compiler can reasonably assume that if they're handed an ObjectStore and an ActionCache, they're in the same namespace. They can verify this by asserting that getNamespace() returns the same thing... but there's no need to check namespaces after each get().
| /// until the daemon shuts down but sharing work as long as the daemon | ||
| /// lives | ||
| /// | ||
| /// "Capture" out-parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is too complicated. Sounds much simpler if we can just offload the decision to the CAS implementation and the user of CASDB just uses whatever is provided. I feel like the CAS implementation need to know about the life time if that is a concern (for example, don't invalidate the tree while it is being visited). Maybe ref count the return buffer and notify the CAS on release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a bit complicated.
Note that CASDB dies and is replaced by InMemoryView (which has a high-level API) and ActionCache (for caching).
What you're commenting on here are low-level ObjectStore APIs (which InMemoryView will layer on top of). But you're referring to a CASDB. What do you mean by CASDB?
Assuming you mean ObjectStore: the eventual goal is to move the underlying ObjectStore out of process (into a daemon), with object content passed via shared memory. Since the protocol has not been designed, it's a bit hard to reason about. I'm trying to leave room for something fairly low-level that doesn't require per-object allocations and reference counting. For example, perhaps small objects can most efficiently be sent across the process boundary by using a large-ish, sharded, shared memory region that's used as a scratch buffer (or, more simply, over the socket), whereas large objects would have a dedicated mmap'ed memory region.
for example, don't invalidate the tree while it is being visited
This sounds like a concern for a user of InMemoryView; WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't not aware of all the terminology changes. But yes, I mean ObjectStore.
The main concern I have is that if the compiler need to constantly aware of which low level protocol it needs to use for different object vs. the decision is entirely made by the CAS/plugin. I would highly prefer the later since that makes the life of people who do not want to think about CAS easier, have a consistent interface which is easier to wrapper under something like a file system, and much easier to scale with different plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect there to be almost no direct users of ObjectStore: instead, it'll be adapted by an InMemoryView to provide lifetime. Unit tests may use the object store directly for testing the low-level interface, and maybe there will be other special cases.
It sounds to me like the use cases you're describing would be interacting with InMemoryView. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means InMemoryView need to be tuned for different plugins? That sounds better but I am not sure if those decisions should be made within LLVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client (usually InMemoryView) doesn't have to tune. The design of the low-level ObjectStore APIs is that the client passes a Capture object (usually InMemoryView) which includes the capabilities the client has (for blobs, this could include raw_ostream& (copy bytes), StringRef & (permanent lifetime), mapped_file_region (ownership), etc.) and an indication of its preferences (e.g., only use mapped_file_region for blobs between 16K and 2GB) and then the object store implementation/plugin sends the object over using one of the mechanisms.
For something a bit more concrete, see the incomplete sketch of BlobCapture/TreeCapture/NodeCapture deleted at dexonsmith@6dffd64#diff-5d292c70f2336e2afef84ce13f5df8b6ad948290677fa37618de819a6c9d3d70, coming from:
https://github.com/dexonsmith/llvm-project/commits/experimental/cas/refactor-casdb-before-deleting-history/llvm/include/llvm/CAS/CASDB.h
| /// | ||
| /// Instances should be declared \a ManagedStatic to compare by pointer | ||
| /// identity. | ||
| class Namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why we need Namespace instead of just put everything in CASDB and reference CASDB instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two CAS instances can be in the same namespace. For example, consider creating the following five builtin CAS instances:
std::unique_ptr<ObjectStore> CASes[] = {
createInMemoryCAS(),
createInMemoryCAS(), // independent in-memory CAS instance
createOnDiskCAS("/path/1"), // on-disk storage
createOnDiskCAS("/path/1"), // same on-disk storage, but different in-memory instance
createOnDiskCAS("/path/2"), // different on-disk storage
};
These assertions should succeed:
const Namespace &NS = CASes[0].getNamespace();
for (auto &CAS : llvm::make_range(CASes))
assert(&NS == &CAS.getNamespace());
This property seems critical to me for decoupling ActionCache from ObjectStore. Otherwise, how does ActionCache return a UniqueID, if the UniqueID needs to reference an ObjectStore?
cachemeifyoucan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fix the build here: https://github.com/cachemeifyoucan/llvm-project/tree/CASDB-refactoring-build-fix
I can send a PR to your PR if needed.
I agree that we should probably have a better interface for ActionDecription. In my branch, I still use a tree builder to build a CASID in CAS storage then use that to build ActionDescription then index ActionCache. That feels a bit weird but it can probably easily be improved.
|
|
||
|
|
||
| Expected<CASID> CASDB::parseCASID(StringRef Reference) { | ||
| UniqueID ID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has a lifetime issue as UniqueID is free'd as function returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch; this should be pushing the IDs onto a bumpptrallocator like BuiltinCAS::parseCASID was before; I'll fix.
Awesome, thanks, I'll take a look.
No need; I'll probably cherry-pick and then hack on it from there.
Thanks! |
Currently a WIP. Goal of this branch is:
Have a decent sketch for the first part, and I've started working on the second part.
Just now I split out two prep PRs:
#3678
#3677
I may end up splitting out other prep patches, not sure yet.