forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtensor.h
186 lines (174 loc) · 7.1 KB
/
tensor.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
#pragma once
#include <torch/csrc/Export.h>
#include <torch/csrc/autograd/function.h>
#include <torch/csrc/autograd/variable.h>
#include <ATen/TensorGeometry.h>
#include <ATen/core/DeprecatedTypeProperties.h>
#include <c10/util/Optional.h>
#include <cstdint>
#include <memory>
namespace torch {
namespace autograd {
struct TORCH_API CopyBackwards : public Node {
variable_list apply(variable_list&& grads) override;
void compiled_args(CompiledNodeArgs& args) override;
variable_list apply_with_saved(
const variable_list& inputs,
SwapSavedVariables& saved) override;
at::TensorOptions src_options;
};
// Note [View + Inplace update for base tensor]
//
// This note covers a few important topics related to view + inplace handling.
// - It explains what is the CopySlices Node and why we need it.
// - It explains the considerations on what is saved for backward in
// CopySlices.
// - It explains why we need to sometimes change the exec_info of the current
// backward
//
// What is CopySlices?
// ~~~~~~~~~~~~~~~~~~~
//
// We support autograd with inplace mutation; e.g., if you write x.mul_(2)
// the autograd will work as if you now had multiple Tensors under the hood and
// you did
// x = t.clone()
// x0 = x
// x1 = x0 * 2
// x = x1
// As you can see here, after this operation, x.grad_fn now points to x1.grad_fn
// (the MulBackward node) and this node points to x's original grad_fn (which is
// also x0.grad_fn). It is important to keep in mind that after the inplace,
// there is no Tensor object that represents the x0 state anymore. But the graph
// for it is still around in autograd (in case x was used before being modified
// inplace). See Example 1 in
// https://docs.google.com/drawings/d/1-T5DyYfChMX1ONQkY-zU-hj_ayQ2zmA5CBOKDWqvEhE
// We call this rebasing the history of the Tensor.
//
// Now, a difficult situation is what happens if x is a differentiable view
// of a base b.
// b = t.clone()
// x = b.select(0, 0)
// x *= 2
// With the same approach as above, this will become
// b = t.clone()
// x = b.select(0, 0)
// b0 = b
// x0 = x
// x1 = x0 * 2
// b1 = b0.select_scatter(x1, 0, 0)
// x2 = b1.select(0, 0)
// x = x2
// b = b1
// As you can see here, not only we need to modify x's grad_fn, we also need to
// modify the one from b. We also need to ensure that the new grad_fn on x is
// linked to b's new grad_fn. The chain the select_scatter, multiplication and
// select is what CopySlices does, all wrapped into a single Node.
//
// See Example 1 in
// https://docs.google.com/drawings/d/1-T5DyYfChMX1ONQkY-zU-hj_ayQ2zmA5CBOKDWqvEhE
//
// What do we need to save in CopySlices to run backward?
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//
// We need to perform grad_view = fn(grad_view), but out-of-place.
// view_fn_ is an optional function saved in DifferentiableViewMeta
// from forward pass, so that we can recover we when as_strided is not
// supported. It preserves the invariants:
// view = view_fn_(base)
// grad_view = view_fn_(grad_base)
//
// When as_strided is supported (e.g. strided CPU/CUDA Tensors), view_fn_
// is empty and we save TensorGeometry(view) instead.
// With the TensorGeometry information we can use `as_strided` call which
// is more efficient to recover views in backward.
//
// For example:
// view_1 = view_op_1(base)
// view_2 = view_op_2(view_1)
// ...
// view_n = view_op_n(view_n-1)
// view_n = inplace_op(view_n)
//
// In CPU/CUDA case where we support efficient as_strided implementation,
// grad_view_n can be calculated through 1 step.
//
// grad_view_n = grad_base.as_strided(view_sizes, view_strides, view_offset);
//
// But in XLA backend where we don't have full support of as_strided,
// it has to save a chained lambda function view_fn_, to exactly
// replay how the view was done in forward.
//
// view_fn_ = view_op_n(...(view_op_2(view_op_1())))
// grad_view_n = view_fn_(grad_base)
//
// This chain view_fn_ works as long as forward view ops are implemented,
// e.g XLA simulates view without a real Storage behind Tensor, but it's less
// efficient than the as_strided one so we should be careful to only use it when
// necessary.
//
// - For CPU/CUDA we save TensorGeometry of both base and view tensors,
// That's all we need to pass into as_strided.
// E.g. int[] sizes, int[] strides, and int storage_offset.
// - For XLA we use view_fn_, which captures all forward view op arguments
// by **value**.
// E.g for at::narrow, int dim, int start, in length are saved.
//
// Theoretically we could also save Tensor `view` in CopySlices Node, but
// it's far more expensive than what we currently save.
// 1. We cannot afford keeping large tensors alive to recover views only.
// 2. There are inplace checks when Tensors are loaded back to make sure
// they haven't been changed (including size metadata).
// So saving metadata like TensorGeometry/view arguments is much better
// because it is minimal information needed to recover views, as well as it
// allows the user to modify the original Tensor without preventing the
// backward pass from running.
//
// Why do we manually change exec_info in the apply?
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//
// Using the same example as before,
// b = t.clone()
// x = b.select(0, 0)
// x *= y
//
// You can see the visualization at
// https://docs.google.com/drawings/d/1Bx-Hcz-zlIv7PabQqnPhUIVIs9F8WWi48svqMsAUMFs
// which contains the wrapped MulBackward Node and show what it links to.
// Since a backward can happen between any subset of the inputs (t and y) and
// outputs (o, x, b). It is possible to get into a state where CopySlices's 0th
// next function (CloneBackward) needs gradient but MulBackward's 0th next
// function (SelectBackward) is not. This happens if you do autograd.grad
// between x and t for example.
// In such a case, we do need to mark SelectBackward as requiring gradient such
// that, during the execution of MulBackward, we will actually compute gradient
// for the 0th input.
//
// All the other next functions are always shared (this is asserted in the apply
// code) and so nothing needs to be done for them.
// See Note [View + Inplace update for view tensor] for what we do to view
// tensor when an in-place operation happens.
struct TORCH_API CopySlices : public Node {
CopySlices(
const Variable& base_var,
at::TensorGeometry view_,
std::unique_ptr<ViewFunc> view_fn_,
std::shared_ptr<Node> fn_);
// common code between apply/apply_with_saved
template <typename T>
variable_list apply_impl(variable_list&& inputs, const T& call_fn);
variable_list apply(variable_list&& inputs) override;
void release_variables() override;
void compiled_args(CompiledNodeArgs& args) override;
variable_list apply_with_saved(
const variable_list& inputs,
SwapSavedVariables& saved) override;
at::TensorGeometry base;
// view and view_fn are redundant and view_fn will be used if available.
// See Note [View + Inplace update for base tensor] for details.
at::TensorGeometry view;
std::unique_ptr<ViewFunc> view_fn;
std::shared_ptr<Node> fn;
};
} // namespace autograd
} // namespace torch