Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API for device support #96

Merged
merged 7 commits into from
Jan 12, 2021
Merged

Conversation

rgommers
Copy link
Member

@rgommers rgommers commented Dec 3, 2020

Based on the discussion in gh-39.

@rgommers rgommers added the RFC Request for comments. Feature requests and proposed changes. label Dec 3, 2020
@rgommers
Copy link
Member Author

The summary of feedback received on this PR so far was:

  • given how different current libraries are w.r.t. device handling, let's keep it minimal and focus only on the use case of dealing with devices in library code.
  • That use case does not require a device object in the API standard or identifying specific physical/logical devices.
  • So remove the object, and the string representation that allows identifying a specific device in a portable way across libraries.
  • Keep only:
    • the .device attribute, which gives back a device object that only needs to be able to compare (__eq/neq__) itself with other devices from the same library
    • the device= keyword for creation functions.
    • a method named to_device on the array object, to move arrays.

@rgommers
Copy link
Member Author

PR updated. And asked about the link to the device_type:device_id on data-apis/consortium-feedback#1. I suspect that the device_id happens to always work there, because libraries will typically use the same ordering as CUDA (as shown by, e.g., nvidia-smi) will give, but that it's technically not guaranteed.

@rgommers
Copy link
Member Author

I will note that it's kind of odd to have a .device attribute that returns some device object that itself is not part of the API. It will work, but it will make it harder to instantiate (it'll either be in a different namespace, or - for numpy - completely missing) and use in type annotations.

@agarwal-ashish
Copy link

What is the behavior in symbolic execution model where devices may be set post hoc and may even change across multiple calls to the same function? Also, is there some notion of symbolic devices ?

As a concrete example, does y = zeros([2], x.device), where x has not been placed on any device yet, imply a colocation directive for x and y ?

@rgommers
Copy link
Member Author

What is the behavior in symbolic execution model where devices may be set post hoc and may even change across multiple calls to the same function?

If the default of device=None is implementation-defined, then that will be completely compatible with symbolic execution right?

Also, is there some notion of symbolic devices ?

It may be good to mention explicitly that symbolic execution is a thing and that the current design must allow for it. Other than that, I think all the current version of this PR says is that a library will have a device object and that it must implement __eq__ and __neq__. That should be fine for symbolic devices too.

The text says "to a specific physical or logical device". That may mean different things to different people, "logical" and "symbolic" sound similar to me; happy to add "symbolic" explicitly though.

@rgommers
Copy link
Member Author

As a concrete example, does y = zeros([2], x.device), where x has not been placed on any device yet, imply a colocation directive for x and y ?

That co-location is current a recommendation, not a hard requirement, in the "semantics" section of this PR. I'd expect also a symbolic execution model to do co-location and raise otherwise, but I'm not sure if that's guaranteed or not (TFs placement policies are a little tricky to wrap my head around just from the docs).

#### Returns

- **out**: _<array>_

- an array containing uninitialized data.

(function-empty_like)=
### empty_like(x, /, *, dtype=None)
### empty_like(x, /, *, dtype=None, device=None)
Copy link
Contributor

@leofang leofang Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of an interest inquiry we had a while ago: Should the _like*() functions honor the device where the input array x is on? (cupy/cupy#3457)

Now I look back, it seems also plausible if the output array is on the same device as x, but my rejection still holds: it is incompatible with any sane device management approaches (context manager, explicit function call (such as use_device(N)), etc). I suppose having the newly added argument device will encounter the same challenge. The most radical example is x on device 1, the argument device is set to 2, but the default/current device is 0.

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but my rejection still holds: it is incompatible with any sane device management approaches

I agree with the argument you made on that issue.

I suppose having the newly added argument device will encounter the same challenge. The most radical example is x on device 1, the argument device is set to 2, but the default/current device is 0.

I don't see the conflict here. If a library has multiple ways of controlling device placement, the most explicit method should have the highest priority. So:

  1. If device= keyword is specified, that always takes precedence
  2. If device=None, then use the setting from a context manager, if set.
  3. If no context manager was used, then use the global default device/strategy

Your example seems very similar to the first example of https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics, which I think explains desired behaviour here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty_like description seems clear enough: the _like is about shape and dtype only.

Copy link
Contributor

@leofang leofang Dec 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @rgommers I fully agree with everything you said above, but I noticed in the "Semantics" section that you're adding in this PR, the wording is a bit different and (arguably) less clear. Should we incorporate your above reply there? And it'd be nice to add a variant of the "_like is about shape and dtype only" emphasis to all the docstrings where device is an optional argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ideas, done.

@rgommers
Copy link
Member Author

The text says "to a specific physical or logical device". That may mean different things to different people, "logical" and "symbolic" sound similar to me; happy to add "symbolic" explicitly though.

I searched the TensorFlow more thoroughly, and there's no such thing as a "symbolic device" there. I'm pretty sure it's LogicalDevice instead, so the current text in this PR seems clear to me.

Assumes a bit less about implementation. A string like `'cpu'`
should meet the requirements, and it doesn't have `__neq__`.
@rgommers
Copy link
Member Author

No more comments in the last couple of weeks, so I'll go ahead and merge this PR so it's available in the published html docs. If there are any more comments, please add them here or open a new issues.

@rgommers rgommers merged commit eb276c1 into data-apis:main Jan 12, 2021
@leofang
Copy link
Contributor

leofang commented Jan 13, 2021

LGTM. Agreed if there's any concern we can revisit 🙂

@agarwal-ashish
Copy link

As a concrete example, does y = zeros([2], x.device), where x has not been placed on any device yet, imply a colocation directive for x and y ?

That co-location is current a recommendation, not a hard requirement, in the "semantics" section of this PR. I'd expect also a symbolic execution model to do co-location and raise otherwise, but I'm not sure if that's guaranteed or not (TFs placement policies are a little tricky to wrap my head around just from the docs).

Note that with symbolic execution, there is a "tracing" phase and an execution phase. If x.device is None at the time the graph is traced, then device=x.device is effectively ignored and colocation will not be done by default. If we want function execution to provide colocation, we need to put extra directives inside the traced graph. Then function runtime can respect those. But this will not be done by default.

A side effect is that there will be different behavior in Eager vs symbolic execution. For Eager, x.device will be set by the time y is computed, hence colocation will be done, but function execution will ignore it if x.device is not specified at function tracing time. This will generally be the case, since code is mostly written to be agnostic of actual placement and the placement is often done at tf.function call time instead of tf.function tracing time.

@rgommers
Copy link
Member Author

If we want function execution to provide colocation, we need to put extra directives inside the traced graph. Then function runtime can respect those. But this will not be done by default.

Just checking, are you saying that that won't be done now, if you simply add the device= keyword in the most naive way? Or you do not want to do this in TensorFlow by default?

It seems desirable to just always put those extra directives in I'd think, and not let eager/symbolic behavior diverge?

@agarwal-ashish
Copy link

If we want function execution to provide colocation, we need to put extra directives inside the traced graph. Then function runtime can respect those. But this will not be done by default.

Just checking, are you saying that that won't be done now, if you simply add the device= keyword in the most naive way? Or you do not want to do this in TensorFlow by default?

It seems desirable to just always put those extra directives in I'd think, and not let eager/symbolic behavior diverge?

We will likely make this work best-effort in TensorFlow when we try to adhere to these new APIs. Given that this device placement seems like a recommendation instead of a requirement, it should be ok I think. This is a subtle implementation issue for frameworks implementing symbolic execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants