Add API for device support #96

rgommers · 2020-12-03T00:10:53Z

Based on the discussion in gh-39.

spec/design_topics/device_support.md

rgommers · 2020-12-16T19:10:30Z

The summary of feedback received on this PR so far was:

given how different current libraries are w.r.t. device handling, let's keep it minimal and focus only on the use case of dealing with devices in library code.
That use case does not require a device object in the API standard or identifying specific physical/logical devices.
So remove the object, and the string representation that allows identifying a specific device in a portable way across libraries.
Keep only:
- the .device attribute, which gives back a device object that only needs to be able to compare (__eq/neq__) itself with other devices from the same library
- the device= keyword for creation functions.
- a method named to_device on the array object, to move arrays.

rgommers · 2020-12-16T19:16:54Z

PR updated. And asked about the link to the device_type:device_id on data-apis/consortium-feedback#1. I suspect that the device_id happens to always work there, because libraries will typically use the same ordering as CUDA (as shown by, e.g., nvidia-smi) will give, but that it's technically not guaranteed.

rgommers · 2020-12-16T19:22:58Z

I will note that it's kind of odd to have a .device attribute that returns some device object that itself is not part of the API. It will work, but it will make it harder to instantiate (it'll either be in a different namespace, or - for numpy - completely missing) and use in type annotations.

agarwal-ashish · 2020-12-17T01:20:03Z

What is the behavior in symbolic execution model where devices may be set post hoc and may even change across multiple calls to the same function? Also, is there some notion of symbolic devices ?

As a concrete example, does y = zeros([2], x.device), where x has not been placed on any device yet, imply a colocation directive for x and y ?

rgommers · 2020-12-17T11:00:40Z

What is the behavior in symbolic execution model where devices may be set post hoc and may even change across multiple calls to the same function?

If the default of device=None is implementation-defined, then that will be completely compatible with symbolic execution right?

Also, is there some notion of symbolic devices ?

It may be good to mention explicitly that symbolic execution is a thing and that the current design must allow for it. Other than that, I think all the current version of this PR says is that a library will have a device object and that it must implement __eq__ and __neq__. That should be fine for symbolic devices too.

The text says "to a specific physical or logical device". That may mean different things to different people, "logical" and "symbolic" sound similar to me; happy to add "symbolic" explicitly though.

rgommers · 2020-12-17T11:03:17Z

As a concrete example, does y = zeros([2], x.device), where x has not been placed on any device yet, imply a colocation directive for x and y ?

That co-location is current a recommendation, not a hard requirement, in the "semantics" section of this PR. I'd expect also a symbolic execution model to do co-location and raise otherwise, but I'm not sure if that's guaranteed or not (TFs placement policies are a little tricky to wrap my head around just from the docs).

leofang · 2020-12-19T03:32:33Z

spec/API_specification/creation_functions.md

 #### Returns

 -   **out**: _&lt;array&gt;_

    -   an array containing uninitialized data.

 (function-empty_like)=
-### empty_like(x, /, *, dtype=None)
+### empty_like(x, /, *, dtype=None, device=None)


This reminds me of an interest inquiry we had a while ago: Should the _like*() functions honor the device where the input array x is on? (cupy/cupy#3457)

Now I look back, it seems also plausible if the output array is on the same device as x, but my rejection still holds: it is incompatible with any sane device management approaches (context manager, explicit function call (such as use_device(N)), etc). I suppose having the newly added argument device will encounter the same challenge. The most radical example is x on device 1, the argument device is set to 2, but the default/current device is 0.

Thoughts?

but my rejection still holds: it is incompatible with any sane device management approaches

I agree with the argument you made on that issue.

I suppose having the newly added argument device will encounter the same challenge. The most radical example is x on device 1, the argument device is set to 2, but the default/current device is 0.

I don't see the conflict here. If a library has multiple ways of controlling device placement, the most explicit method should have the highest priority. So:

If device= keyword is specified, that always takes precedence

If device=None, then use the setting from a context manager, if set.

If no context manager was used, then use the global default device/strategy

Your example seems very similar to the first example of https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics, which I think explains desired behaviour here.

The empty_like description seems clear enough: the _like is about shape and dtype only.

Thanks, @rgommers I fully agree with everything you said above, but I noticed in the "Semantics" section that you're adding in this PR, the wording is a bit different and (arguably) less clear. Should we incorporate your above reply there? And it'd be nice to add a variant of the "_like is about shape and dtype only" emphasis to all the docstrings where device is an optional argument.

Good ideas, done.

rgommers · 2020-12-20T17:28:50Z

The text says "to a specific physical or logical device". That may mean different things to different people, "logical" and "symbolic" sound similar to me; happy to add "symbolic" explicitly though.

I searched the TensorFlow more thoroughly, and there's no such thing as a "symbolic device" there. I'm pretty sure it's LogicalDevice instead, so the current text in this PR seems clear to me.

Assumes a bit less about implementation. A string like `'cpu'` should meet the requirements, and it doesn't have `__neq__`.

rgommers · 2021-01-12T14:06:01Z

No more comments in the last couple of weeks, so I'll go ahead and merge this PR so it's available in the published html docs. If there are any more comments, please add them here or open a new issues.

leofang · 2021-01-13T05:14:15Z

LGTM. Agreed if there's any concern we can revisit 🙂

agarwal-ashish · 2021-01-13T07:02:11Z

As a concrete example, does y = zeros([2], x.device), where x has not been placed on any device yet, imply a colocation directive for x and y ?

That co-location is current a recommendation, not a hard requirement, in the "semantics" section of this PR. I'd expect also a symbolic execution model to do co-location and raise otherwise, but I'm not sure if that's guaranteed or not (TFs placement policies are a little tricky to wrap my head around just from the docs).

Note that with symbolic execution, there is a "tracing" phase and an execution phase. If x.device is None at the time the graph is traced, then device=x.device is effectively ignored and colocation will not be done by default. If we want function execution to provide colocation, we need to put extra directives inside the traced graph. Then function runtime can respect those. But this will not be done by default.

A side effect is that there will be different behavior in Eager vs symbolic execution. For Eager, x.device will be set by the time y is computed, hence colocation will be done, but function execution will ignore it if x.device is not specified at function tracing time. This will generally be the case, since code is mostly written to be agnostic of actual placement and the placement is often done at tf.function call time instead of tf.function tracing time.

rgommers · 2021-01-13T09:00:08Z

If we want function execution to provide colocation, we need to put extra directives inside the traced graph. Then function runtime can respect those. But this will not be done by default.

Just checking, are you saying that that won't be done now, if you simply add the device= keyword in the most naive way? Or you do not want to do this in TensorFlow by default?

It seems desirable to just always put those extra directives in I'd think, and not let eager/symbolic behavior diverge?

agarwal-ashish · 2021-01-14T04:41:06Z

If we want function execution to provide colocation, we need to put extra directives inside the traced graph. Then function runtime can respect those. But this will not be done by default.

Just checking, are you saying that that won't be done now, if you simply add the device= keyword in the most naive way? Or you do not want to do this in TensorFlow by default?

It seems desirable to just always put those extra directives in I'd think, and not let eager/symbolic behavior diverge?

We will likely make this work best-effort in TensorFlow when we try to adhere to these new APIs. Given that this device placement seems like a recommendation instead of a requirement, it should be ok I think. This is a subtle implementation issue for frameworks implementing symbolic execution.

rgommers added 3 commits December 3, 2020 00:09

Complete the "device support" section

fb75adf

Closes data-apisgh-39

Add device keywords to the creation functions

a90ea43

Add the device object and a device array attribute

4b4a250

rgommers added the RFC Request for comments. Feature requests and proposed changes. label Dec 3, 2020

rgommers commented Dec 3, 2020

View reviewed changes

spec/design_topics/device_support.md Outdated Show resolved Hide resolved

rgommers mentioned this pull request Dec 16, 2020

[RFC] Adopt DLPack as cross-language C ABI stable data structure for array exchange data-apis/consortium-feedback#1

Closed

rgommers added 2 commits December 16, 2020 20:11

Update for review comments, narrow scope of syntax and semantics

48fbee9

Remove device object from API specification

ba5a1e7

leofang reviewed Dec 19, 2020

View reviewed changes

rgommers added 2 commits December 29, 2020 14:27

Address review comments about control method priority, and *_like

1ea89be

A small tweak on wording for .device

81b1ee8

Assumes a bit less about implementation. A string like `'cpu'` should meet the requirements, and it doesn't have `__neq__`.

rgommers merged commit eb276c1 into data-apis:main Jan 12, 2021

rgommers deleted the device-support-api branch March 30, 2021 10:10

rgommers mentioned this pull request Mar 30, 2021

Device object and device kwarg for creation #156

Closed

leofang mentioned this pull request Sep 6, 2021

Interplay between copy and device in asarray()? #254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API for device support #96

Add API for device support #96

rgommers commented Dec 3, 2020

rgommers commented Dec 16, 2020

rgommers commented Dec 16, 2020

rgommers commented Dec 16, 2020

agarwal-ashish commented Dec 17, 2020

rgommers commented Dec 17, 2020

rgommers commented Dec 17, 2020

leofang Dec 19, 2020 •

edited

Loading

rgommers Dec 20, 2020

rgommers Dec 20, 2020

leofang Dec 28, 2020 •

edited

Loading

rgommers Dec 29, 2020

rgommers commented Dec 20, 2020

rgommers commented Jan 12, 2021

leofang commented Jan 13, 2021

agarwal-ashish commented Jan 13, 2021

rgommers commented Jan 13, 2021

agarwal-ashish commented Jan 14, 2021

Add API for device support #96

Add API for device support #96

Conversation

rgommers commented Dec 3, 2020

rgommers commented Dec 16, 2020

rgommers commented Dec 16, 2020

rgommers commented Dec 16, 2020

agarwal-ashish commented Dec 17, 2020

rgommers commented Dec 17, 2020

rgommers commented Dec 17, 2020

leofang Dec 19, 2020 • edited Loading

Choose a reason for hiding this comment

rgommers Dec 20, 2020

Choose a reason for hiding this comment

rgommers Dec 20, 2020

Choose a reason for hiding this comment

leofang Dec 28, 2020 • edited Loading

Choose a reason for hiding this comment

rgommers Dec 29, 2020

Choose a reason for hiding this comment

rgommers commented Dec 20, 2020

rgommers commented Jan 12, 2021

leofang commented Jan 13, 2021

agarwal-ashish commented Jan 13, 2021

rgommers commented Jan 13, 2021

agarwal-ashish commented Jan 14, 2021

leofang Dec 19, 2020 •

edited

Loading

leofang Dec 28, 2020 •

edited

Loading