Skip to content

Conversation

@dexonsmith
Copy link

Currently a WIP. Goal of this branch is:

  • Design new CAS APIs that separate a few concepts out of CASDB: object storage vs. action cache vs. in-memory lifetime guarantees.
  • Split out the action cache.

Have a decent sketch for the first part, and I've started working on the second part.

Just now I split out two prep PRs:
#3678
#3677

I may end up splitting out other prep patches, not sure yet.

@dexonsmith dexonsmith self-assigned this Dec 13, 2021
@dexonsmith
Copy link
Author

@cachemeifyoucan @akyrtzi, FYI. Haven't actually split out ActionCache yet so it's not ready for review, but there's a long header comment on class CASDB that documents the intended design; let me know if you have comments there.

@dexonsmith dexonsmith force-pushed the experimental/cas/refactor-casdb branch from 000aa0f to 87d172b Compare December 15, 2021 00:07
Note: surely does not compile at this point.
Note: surely, still nothing compiles
Add sys::fs::readNativeFileToLimit(). Whereas readNativeFileToEOF()
reads an entire file, this just reads up to the provided limit.
Still need to thread through an ActionCache somewhere...
A few notes based on using ActionCache somewhere:

- If ActionDescription owned its parameters it'd be easier to build.
  Seems a bit error-prone right now.
    - Needed to find a place to store the command-line for "extra data".
      If we wanted more stuff in ExtraData, this'd get pretty tough.
    - Needed to create a SmallVector to hold the InputIDs.
- But, it's a nice property that ActionDescription is immutable.
    - If the content (not just the cache) is mutable then it's hard to
      reason about.
- Potential fix:
    - Make ActionDescription own what it references.
        - Could allow the errors to be lazy about printing the action as
          well (store a copy of the description, and just print in
          `ErrorInfo::log()`).
    - Add ActionDescriptionBuilder, with simple mutating APIs, and a
      constructor `ActionDescription(ActionDescriptionBuilder&&)`.
    - ActionDescriptionRef is lightweight and can be used without
      building (... but maybe this one isn't useful in practice?)
- Conversions between CASID and UniqueIDRef are annoying... but that's
  just a transition problem.
@dexonsmith
Copy link
Author

@akyrtzi @cachemeifyoucan , I've started adopting the new ActionCache APIs in 6df65af -- that's just for TableGen. Let me know if you see any red flags.

(FYI, I'll clean up commit history before pushing...)

@akyrtzi
Copy link

akyrtzi commented Dec 22, 2021

Full disclosure, I haven't looked at everything in the PR so most likely I'm missing context 😅. But I wanted to comment on the ActionCache interface.

  • Could you clarify a bit the functionality of cas::Namespace? Is this essentially a convenient way for the client to avoid key collisions by automatically adding the namespace to all keys? Or does it have more purposes than that?

  • I'm in favor of having an ActionDescriptionBuilder (as you mention in one of the commits) as convenient way to create an ActionDescription. That would provide convenient and type-safe APIs to include arbitrary serializable types (e.g. integer types); ActionDescription's StringRef ExtraData is hard to work with.

  • Assuming ActionDescriptionBuilder existed:

    • Is there a particular benefit for ActionDescription's structure, beyond having a readable dump output?

    • Would it make sense to essentially have a reference to the underlying hashed ID for querying the ActionCache? In a commit you mention "Make ActionDescription own what it references" but it's not clear to me how beneficial this is in practice, is ActionDescription intended to be something long-living?

      I have this view of ActionDescription as something transient that is created "on-the-fly" only for the purpose of interfacing with the ActionCache; the client should have a domain-specific way to model actions and ActionDescription is the way to compose the action key for them. Within this context the ActionDescriptionBuilder could play this role for creating what is semantically the ActionKey.

  • Could you clarify what is the difference between "Get a result from the cache" and "Get a result from the cache with extended lifetime"?

Previously we had a nice hack that the CASDB could be
accessed from the CASFileSystem, but now we actually need
to pass teh ActionCache through somehow.
@dexonsmith
Copy link
Author

Could you clarify a bit the functionality of cas::Namespace? Is this essentially a convenient way for the client to avoid key collisions by automatically adding the namespace to all keys? Or does it have more purposes than that?

If you haven't read it, the long comment in CASDB might help. But maybe it's not clear.

It's more.

Previously, the high-level design was that the ActionCache and ObjectStore were part of the same thing (a CASDB). Any subclass of CASDB had to provide both. Any plug-in provided both.

Now, they are independently subclassed (note that CASDB hasn't changed names yet; that'll wait for ObjectStore/InMemoryView to be split apart). You can have a plug-in for one and a built-in for the other. E.g., you might use a plug-in for some 3rd party CAS but not hook up a custom ActionCache; clang will want to spin up an in-memory cache.

Three things can vary between different ObjectStore implementations:

  • Hash function (SHA256 vs BLAKE3)
  • Number of bits used from hash (24 bits from BLAKE3 vs all 32)
  • Schema for objects (what is the layout of a tree, and what precise bits are hashed? Is Blob always a single underlying object or are big ones split up and stored as a DAG under the hood? Are Nodes supported natively or are they emulated using either Blob or Tree?)

This variance means that the IDs (previously CASID, now UniqueIDRef/UniqueID) are not universal identifiers for objects. They are identifiers that belong to a specific namespace.

  • Comparing IDs in the namespace, assuming the hash is strong enough, they are equal if and only if they represent the same semantic object. (Same name, different object.)
  • Between different namespaces, they may be equal even if the objects are unrelated. They will usually be unequal even if they refer to the equivalent object. (Different name, same object.)

However, two different ObjectStore instances may be in the same namespace. E.g.:

  • Create two in-memory CAS instances in a row. They have different storage, but UniqueIDs mean the same thing.
  • Create each of an in-memory and an on-disk object store. If they're both built-in (and assuming they use the same hash/schema; currently it's not configurable but it may be in the future) they have the same namespace.

It's important to know when ObjectStores are in the same namespace. Consider using one as a local cache for another. If they are in the same namespace, you can pass through IDs. If they are not, you need to create/maintain a mapping between them.

Going back to ActionCache.

Before this PR, an ActionCache was a priori tied to the same namespace as the ObjectStore, since they were in the same CASDB. No need to validate.

In this PR, ActionCache is split out as a separate concept. But it is still required to be tied to a specific namespace. That means:

  • If you get a cache hit with get(), you know the cached result ID refers to an object that can retrieved from any ObjectStore in the same namespace.
  • If you get a cache collision with put(), you know it's because the action is missing inputs or is not reproducible, not because the IDs are from different namespaces.

Let me know if it's still not clear and/or what could/should be added to header docs.

@dexonsmith
Copy link
Author

dexonsmith commented Dec 23, 2021

[...] Within this context the ActionDescriptionBuilder could play this role for creating what is semantically the ActionKey.

That would be simpler in a lot of ways. Maybe it'd be better.

Main reason I avoided this was because I didn't want to enforce anything on the ActionCache implementation. Maybe some impls want the key to be an action description serialization? Maybe some want to store a checksum of the description to detect hash collisions?

You also talk about ActionDescription as ephemeral, and with its current API that probably makes sense. But two useful directions this could make challenging:

  • Add first-class support for actions to refer to each other in a DAG.
  • Add built-in (de)serialization support for actions so an action graph can be stored.

Another problem is that each ActionKey effectively has a namespace of some sort tied to the hash function (or serialization) used by its ActionCache. Probably not too hard to solve though.

(I think what what makes sense to me is for an Action to be a second-class object type (built on Node) that can be hashed by an ObjectHasher (parent class of ObjectStore), and the ActionKey is a wrapper around a UniqueID from that namespace. But: you don't need to store an action to get its key, and the ObjectHasher used for the actions in an actiongraph doesn't need to be in the same namespace as the one that the cached results are in; and if actions are stored, the lifetime of the actiongraph object store can be distinct from the one for the cached results (e.g., each build directory might want a CAS to store action graphs for communicating actions, allowing processes to refer to actions by their ActionKey).)

@dexonsmith
Copy link
Author

Could you clarify what is the difference between "Get a result from the cache" and "Get a result from the cache with extended lifetime"?

One returns UniqueID. This owns storage to the hash, like a std::string. Any key-value store can implement this trivially, but the type is bulkier... and it's hard to use with the existing CASDB because CASIDs are expected to reference memory owned by the CAS.

The other returns UniqueIDRef. This references unowned memory, like a StringRef. This API must guarantee reference validity as long as the ActionCache lives.

The second API may come with extra cost for some ActionCache implementations. It's possible we'd want to prefer using the former API. But pretty hard to reason about with CASDB's current design.

Note the difference in APIs between ObjectStore (no assumption of lifetime guarantees, using UniqueID) and its subclass InMemoryView (lifetime extended, using UniqueIDRef in things like InMemoryTree).

(Note that CASID aligns with UniqueIDRef.)

Copy link

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 3 general questions after reading the implementation. The main concern is this is very complicated to use. I don't know you have some design in your mind specifically that needed the complexity.

/// \post Subsequent call to \a get() with \p Action returns \p Result.
/// Returns an error if there is a \a get() will not return
Error put(const ActionDescription &Action, UniqueIDRef Result) {
assert(&Result.getNamespace() == &getNamespace());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we are decouple ActionCache from CAS that store the artifacts, but this assertion means they are still in the same namespace?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is decoupled from storage, but you need some way of knowing whether the UniqueIDs that they both refer to mean the same thing. Otherwise, the action cache might store an ID from an LLVM builtin CAS and then you try to look the object up in a custom CAS of some sort where the ID has a different meaning.

Sharing a "namespace" tells you that a UniqueID has the same meaning.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I wasn't clear what is covered by a Namespace. Is that Key, or Value, or the mapping between Key and Value?

The Value/Result from ActionCache should be available as Key in ObjectStore but is very different from Value/Result in ObjectStore. I guess your definition of namespace is that everything within the namespace has uniqueID references the same CAS/ObjectStore?

If that is the case, does this address the original concern that ActionCache key can just be a hash that doesn't anything in the object store? If that is possible, then what is the namespace of ActionCache (whose key and value are in different namespace)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object store and action cache are fundamentally different. One is a set and the other is a map.

An ObjectStore is a set of objects. It has a schema for creating a content-derived unique identifier/handle for arbitrary objects, which can be used as a handle for retrieving on object later. This identifier must be stable, not dependent on state of the store itself, and uniquely identifying, not having collisions. So, a strong hash. While it is somewhat convenient to think of it as a map from a hash to an object, that's not really what's going on... it's really a set of objects, and objects have an inherent UID defined by their content.

This UID is universal (it's a UUID), assuming the universe is a particular CAS schema.

For example, the SHA1 identifiers used by Git are UUIDs in the context of Git. You can have multiple Git repos, each that have never talked to each other, and if you ingest the same filesystem directory you'll get the same UID for the tree object. But ingest that same directory into, say, a Bazel instance, and you'll get a different UID. This is again a UUID, but only in the context of Bazel.

You cannot take UIDs from one namespace and use them directly in another namespace.

An ActionCache is a map from an action (key) to a CAS object (value).

  • At a high-level, an action (key) is just an arbitrary string. A particular action cache need not even store the string; maybe it'll use a strong hash algorithm and just store the hash. Different action cache implementations may make different choices here; could be a key-value store from the serialized action, or maybe from the hash of the serialized action. The ActionDescription provides a framework for serializing a complex action into a string (or directly into a hash).
  • The CAS object (value) is not stored in the cache. Instead, only its UID is stored there.
  • The ActionCache needs to know the namespace of the CAS objects it stores. Is this mapping to UIDs for a Git CAS, a Bazel CAS, or an LLVM builtin CAS?

I guess I wasn't clear what is covered by a Namespace. Is that Key, or Value, or the mapping between Key and Value?

A Namespace defines the "universe" for UIDs. It answers the question for a UID: in what context are these universal / UUIDs? At a high-level, there is one namespace for each schema that converts from object content to its handle/UID. There would be one namespace referenced by all ObjectStores wrapping a Git CAS, another for all ObjectStores wrapping a Bazel CAS, and another for all ObjectStores wrapping a builtin CAS. If we end up supporting alternative hash functions or UID lengths for the builtin CAS, then there will need to be a different namespace for each hash function.

The Value/Result from ActionCache should be available as Key in ObjectStore but is very different from Value/Result in ObjectStore. I guess your definition of namespace is that everything within the namespace has uniqueID references the same CAS/ObjectStore?

No. You can have lots of ObjectStores (instances of a CAS) that all use the same namespace.

If that is the case, does this address the original concern that ActionCache key can just be a hash that doesn't anything in the object store?

Yes, that's part of it.

If that is possible, then what is the namespace of ActionCache (whose key and value are in different namespace)?

An ActionCache maps from arbitrary string to CAS objects. But it stores UIDs from the CAS objects' namespace. The ActionCache's namespace is the namespace for those CAS objects.

In theory, we could allow an ActionCache to store CAS objects from multiple namespaces. E.g., map "action1" to a Git object, and "action2" to a Bazel object. But I think this would be hard to use. Every time a client looked up a cached result, they'd need to check if the result was in the same namespace as the ObjectStore they have available before looking up the object.

Instead, the current design requires that an action cache only operates on CAS objects in a single namespace. Most parts of the compiler can reasonably assume that if they're handed an ObjectStore and an ActionCache, they're in the same namespace. They can verify this by asserting that getNamespace() returns the same thing... but there's no need to check namespaces after each get().

/// until the daemon shuts down but sharing work as long as the daemon
/// lives
///
/// "Capture" out-parameters:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is too complicated. Sounds much simpler if we can just offload the decision to the CAS implementation and the user of CASDB just uses whatever is provided. I feel like the CAS implementation need to know about the life time if that is a concern (for example, don't invalidate the tree while it is being visited). Maybe ref count the return buffer and notify the CAS on release?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a bit complicated.

Note that CASDB dies and is replaced by InMemoryView (which has a high-level API) and ActionCache (for caching).

What you're commenting on here are low-level ObjectStore APIs (which InMemoryView will layer on top of). But you're referring to a CASDB. What do you mean by CASDB?

Assuming you mean ObjectStore: the eventual goal is to move the underlying ObjectStore out of process (into a daemon), with object content passed via shared memory. Since the protocol has not been designed, it's a bit hard to reason about. I'm trying to leave room for something fairly low-level that doesn't require per-object allocations and reference counting. For example, perhaps small objects can most efficiently be sent across the process boundary by using a large-ish, sharded, shared memory region that's used as a scratch buffer (or, more simply, over the socket), whereas large objects would have a dedicated mmap'ed memory region.

for example, don't invalidate the tree while it is being visited

This sounds like a concern for a user of InMemoryView; WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't not aware of all the terminology changes. But yes, I mean ObjectStore.

The main concern I have is that if the compiler need to constantly aware of which low level protocol it needs to use for different object vs. the decision is entirely made by the CAS/plugin. I would highly prefer the later since that makes the life of people who do not want to think about CAS easier, have a consistent interface which is easier to wrapper under something like a file system, and much easier to scale with different plugins.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect there to be almost no direct users of ObjectStore: instead, it'll be adapted by an InMemoryView to provide lifetime. Unit tests may use the object store directly for testing the low-level interface, and maybe there will be other special cases.

It sounds to me like the use cases you're describing would be interacting with InMemoryView. WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means InMemoryView need to be tuned for different plugins? That sounds better but I am not sure if those decisions should be made within LLVM.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client (usually InMemoryView) doesn't have to tune. The design of the low-level ObjectStore APIs is that the client passes a Capture object (usually InMemoryView) which includes the capabilities the client has (for blobs, this could include raw_ostream& (copy bytes), StringRef & (permanent lifetime), mapped_file_region (ownership), etc.) and an indication of its preferences (e.g., only use mapped_file_region for blobs between 16K and 2GB) and then the object store implementation/plugin sends the object over using one of the mechanisms.

For something a bit more concrete, see the incomplete sketch of BlobCapture/TreeCapture/NodeCapture deleted at dexonsmith@6dffd64#diff-5d292c70f2336e2afef84ce13f5df8b6ad948290677fa37618de819a6c9d3d70, coming from:
https://github.com/dexonsmith/llvm-project/commits/experimental/cas/refactor-casdb-before-deleting-history/llvm/include/llvm/CAS/CASDB.h

///
/// Instances should be declared \a ManagedStatic to compare by pointer
/// identity.
class Namespace {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we need Namespace instead of just put everything in CASDB and reference CASDB instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two CAS instances can be in the same namespace. For example, consider creating the following five builtin CAS instances:

std::unique_ptr<ObjectStore> CASes[] = {
    createInMemoryCAS(),
    createInMemoryCAS(), // independent in-memory CAS instance
    createOnDiskCAS("/path/1"), // on-disk storage
    createOnDiskCAS("/path/1"), // same on-disk storage,  but different in-memory instance
    createOnDiskCAS("/path/2"), // different on-disk storage
};

These assertions should succeed:

const Namespace &NS = CASes[0].getNamespace(); 
for (auto &CAS : llvm::make_range(CASes))
  assert(&NS == &CAS.getNamespace());

This property seems critical to me for decoupling ActionCache from ObjectStore. Otherwise, how does ActionCache return a UniqueID, if the UniqueID needs to reference an ObjectStore?

Copy link

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fix the build here: https://github.com/cachemeifyoucan/llvm-project/tree/CASDB-refactoring-build-fix

I can send a PR to your PR if needed.

I agree that we should probably have a better interface for ActionDecription. In my branch, I still use a tree builder to build a CASID in CAS storage then use that to build ActionDescription then index ActionCache. That feels a bit weird but it can probably easily be improved.



Expected<CASID> CASDB::parseCASID(StringRef Reference) {
UniqueID ID;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has a lifetime issue as UniqueID is free'd as function returned.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch; this should be pushing the IDs onto a bumpptrallocator like BuiltinCAS::parseCASID was before; I'll fix.

@dexonsmith
Copy link
Author

I fix the build here: https://github.com/cachemeifyoucan/llvm-project/tree/CASDB-refactoring-build-fix

Awesome, thanks, I'll take a look.

I can send a PR to your PR if needed.

No need; I'll probably cherry-pick and then hack on it from there.

I agree that we should probably have a better interface for ActionDecription. In my branch, I still use a tree builder to build a CASID in CAS storage then use that to build ActionDescription then index ActionCache. That feels a bit weird but it can probably easily be improved.

Thanks!

dexonsmith added a commit that referenced this pull request Jan 26, 2022
dexonsmith added a commit to dexonsmith/llvm-project that referenced this pull request Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants