You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/stable/rpc/rref.html
+21-21Lines changed: 21 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -334,27 +334,27 @@
334
334
<spanid="id1"></span><h1>Remote Reference Protocol<aclass="headerlink" href="#remote-reference-protocol" title="Permalink to this headline">¶</a></h1>
335
335
<p>This note describes the design details of Remote Reference protocol and walks
336
336
through message flows in different scenarios. Make sure you’re familiar with the
337
-
<aclass="reference internal" href="rpc.html#distributed-rpc-framework"><spanclass="std std-ref">Distributed RPC Framework</span></a> before proceeding.</p>
337
+
<aclass="reference internal" href="../rpc.html#distributed-rpc-framework"><spanclass="std std-ref">Distributed RPC Framework</span></a> before proceeding.</p>
338
338
<divclass="section" id="background">
339
339
<h2>Background<aclass="headerlink" href="#background" title="Permalink to this headline">¶</a></h2>
340
340
<p>RRef stands for Remote REFerence. It is a reference of an object which is
341
341
located on the local or a remote worker, and transparently handles reference
342
342
counting under the hood. Conceptually, it can be considered as a distributed
343
343
shared pointer. Applications can create an RRef by calling
344
-
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a>. Each RRef is owned by the callee worker
345
-
of the <aclass="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> call (i.e., owner) and can be used
344
+
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a>. Each RRef is owned by the callee worker
345
+
of the <aclass="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> call (i.e., owner) and can be used
346
346
by multiple users. The owner stores the real data and keeps track of the global
347
347
reference count. Every RRef can be uniquely identified by a global <codeclass="docutils literal notranslate"><spanclass="pre">RRefId</span></code>,
348
348
which is assigned at the time of creation on the caller of the
<p>On the owner worker, there is only one <codeclass="docutils literal notranslate"><spanclass="pre">OwnerRRef</span></code> instance, which contains
351
351
the real data, while on user workers, there can be as many <codeclass="docutils literal notranslate"><spanclass="pre">UserRRefs</span></code> as
352
352
necessary, and <codeclass="docutils literal notranslate"><spanclass="pre">UserRRef</span></code> does not hold the data. All usage on the owner will
353
353
retrieve the unique <codeclass="docutils literal notranslate"><spanclass="pre">OwnerRRef</span></code> instance using the globally unique <codeclass="docutils literal notranslate"><spanclass="pre">RRefId</span></code>.
354
354
A <codeclass="docutils literal notranslate"><spanclass="pre">UserRRef</span></code> will be created when it is used as an argument or return value in
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">rpc_async()</span></code></a> or
357
-
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> invocation, and the owner will be notified
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">rpc_async()</span></code></a> or
357
+
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> invocation, and the owner will be notified
358
358
according to update the reference count. An <codeclass="docutils literal notranslate"><spanclass="pre">OwnerRRef</span></code> and its data will be
359
359
deleted when there is no <codeclass="docutils literal notranslate"><spanclass="pre">UserRRef</span></code> instances globally and there are no
360
360
reference to the <codeclass="docutils literal notranslate"><spanclass="pre">OwnerRRef</span></code> on the owner as well.</p>
@@ -369,9 +369,9 @@ <h2>Assumptions<a class="headerlink" href="#assumptions" title="Permalink to thi
369
369
may take down all workers, revert to the previous checkpoint, and resume
370
370
training.</p></li>
371
371
<li><p><strong>Non-idempotent UDFs</strong>: We assume the user functions (UDF) provided to
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">rpc_async()</span></code></a> or
374
-
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> are not idempotent and therefore
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">rpc_async()</span></code></a> or
374
+
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> are not idempotent and therefore
375
375
cannot be retried. However, internal RRef control messages will be made
376
376
idempotent and retryable.</p></li>
377
377
<li><p><strong>Out of Order Message Delivery</strong>: We do not assume message delivery order
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">rpc_async()</span></code></a>, or
400
-
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> and uses its RRef as an argument. In this
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">rpc_async()</span></code></a>, or
400
+
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> and uses its RRef as an argument. In this
401
401
case a new <codeclass="docutils literal notranslate"><spanclass="pre">UserRRef</span></code> will be created on the user. As the owner is the caller,
402
402
it can easily update its local reference count on the <codeclass="docutils literal notranslate"><spanclass="pre">OwnerRRef</span></code>.</p>
<p>If Z calls <aclass="reference internal" href="rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">to_here()</span></code></a> on the <codeclass="docutils literal notranslate"><spanclass="pre">UserRRef</span></code>, the
454
+
<p>If Z calls <aclass="reference internal" href="../rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">to_here()</span></code></a> on the <codeclass="docutils literal notranslate"><spanclass="pre">UserRRef</span></code>, the
455
455
owner at least knows A when Z is deleted, because otherwise,
456
-
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">to_here()</span></code></a> wouldn’t finish. If Z does not call
457
-
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">to_here()</span></code></a>, it is possible that the owner
456
+
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">to_here()</span></code></a> wouldn’t finish. If Z does not call
457
+
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">to_here()</span></code></a>, it is possible that the owner
458
458
receives all messages from Z before any message from A and Y. In this case, as
459
459
the real data of the <codeclass="docutils literal notranslate"><spanclass="pre">OwnerRRef</span></code> has not been created yet, there is nothing to
460
460
be deleted either. It is the same as Z does not exist at all. Hence, it’s still
@@ -487,10 +487,10 @@ <h3>User Share RRef with Owner as Return Value<a class="headerlink" href="#user-
487
487
</div>
488
488
<p>In this case, the <codeclass="docutils literal notranslate"><spanclass="pre">UserRRef</span></code> is created on the user worker A, then it is
489
489
passed to the owner worker B together with the remote message, and then B
490
-
creates the <codeclass="docutils literal notranslate"><spanclass="pre">OwnerRRef</span></code>. The method <aclass="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a>
490
+
creates the <codeclass="docutils literal notranslate"><spanclass="pre">OwnerRRef</span></code>. The method <aclass="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a>
491
491
returns immediately, meaning that the <codeclass="docutils literal notranslate"><spanclass="pre">UserRRef</span></code> can be forked/used before
492
492
the owner knows about it.</p>
493
-
<p>On the owner, when receiving the <aclass="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> call, it
493
+
<p>On the owner, when receiving the <aclass="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> call, it
494
494
will create the <codeclass="docutils literal notranslate"><spanclass="pre">OwnerRRef</span></code>, and returns an ACK to acknowledge <codeclass="docutils literal notranslate"><spanclass="pre">{100,</span><spanclass="pre">1}</span></code>
495
495
(<codeclass="docutils literal notranslate"><spanclass="pre">RRefId</span></code>, <codeclass="docutils literal notranslate"><spanclass="pre">ForkId</span></code>). Only after receiving this ACK, can A delete its
496
496
<codeclass="docutils literal notranslate"><spanclass="pre">UserRRef</span></code>. This involves both <strong>G1</strong> and <strong>G2</strong>. <strong>G1</strong> is obvious. For
@@ -499,8 +499,8 @@ <h3>User Share RRef with Owner as Return Value<a class="headerlink" href="#user-
<p>The diagram above shows the message flow, where solid arrow contains user
501
501
function and dashed arrow are builtin messages. Note that the first two messages
502
-
from A to B (<aclass="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> and
503
-
<aclass="reference internal" href="rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">to_here()</span></code></a>) may
502
+
from A to B (<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">remote()</span></code></a> and
503
+
<aclass="reference internal" href="../rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><codeclass="xref py py-meth docutils literal notranslate"><spanclass="pre">to_here()</span></code></a>) may
504
504
arrive at B in any order, but the final delete message will only be sent out
0 commit comments