Skip to content

Commit eeeb43f

Browse files
author
Jessica Lin
authored
Update references to rpc.html
1 parent 46e5e67 commit eeeb43f

File tree

1 file changed

+21
-21
lines changed

1 file changed

+21
-21
lines changed

docs/stable/rpc/rref.html

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -334,27 +334,27 @@
334334
<span id="id1"></span><h1>Remote Reference Protocol<a class="headerlink" href="#remote-reference-protocol" title="Permalink to this headline"></a></h1>
335335
<p>This note describes the design details of Remote Reference protocol and walks
336336
through message flows in different scenarios. Make sure you’re familiar with the
337-
<a class="reference internal" href="rpc.html#distributed-rpc-framework"><span class="std std-ref">Distributed RPC Framework</span></a> before proceeding.</p>
337+
<a class="reference internal" href="../rpc.html#distributed-rpc-framework"><span class="std std-ref">Distributed RPC Framework</span></a> before proceeding.</p>
338338
<div class="section" id="background">
339339
<h2>Background<a class="headerlink" href="#background" title="Permalink to this headline"></a></h2>
340340
<p>RRef stands for Remote REFerence. It is a reference of an object which is
341341
located on the local or a remote worker, and transparently handles reference
342342
counting under the hood. Conceptually, it can be considered as a distributed
343343
shared pointer. Applications can create an RRef by calling
344-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a>. Each RRef is owned by the callee worker
345-
of the <a class="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> call (i.e., owner) and can be used
344+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a>. Each RRef is owned by the callee worker
345+
of the <a class="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> call (i.e., owner) and can be used
346346
by multiple users. The owner stores the real data and keeps track of the global
347347
reference count. Every RRef can be uniquely identified by a global <code class="docutils literal notranslate"><span class="pre">RRefId</span></code>,
348348
which is assigned at the time of creation on the caller of the
349-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> call.</p>
349+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> call.</p>
350350
<p>On the owner worker, there is only one <code class="docutils literal notranslate"><span class="pre">OwnerRRef</span></code> instance, which contains
351351
the real data, while on user workers, there can be as many <code class="docutils literal notranslate"><span class="pre">UserRRefs</span></code> as
352352
necessary, and <code class="docutils literal notranslate"><span class="pre">UserRRef</span></code> does not hold the data. All usage on the owner will
353353
retrieve the unique <code class="docutils literal notranslate"><span class="pre">OwnerRRef</span></code> instance using the globally unique <code class="docutils literal notranslate"><span class="pre">RRefId</span></code>.
354354
A <code class="docutils literal notranslate"><span class="pre">UserRRef</span></code> will be created when it is used as an argument or return value in
355-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.rpc_sync" title="torch.distributed.rpc.rpc_sync"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_sync()</span></code></a>,
356-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_async()</span></code></a> or
357-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> invocation, and the owner will be notified
355+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.rpc_sync" title="torch.distributed.rpc.rpc_sync"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_sync()</span></code></a>,
356+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_async()</span></code></a> or
357+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> invocation, and the owner will be notified
358358
according to update the reference count. An <code class="docutils literal notranslate"><span class="pre">OwnerRRef</span></code> and its data will be
359359
deleted when there is no <code class="docutils literal notranslate"><span class="pre">UserRRef</span></code> instances globally and there are no
360360
reference to the <code class="docutils literal notranslate"><span class="pre">OwnerRRef</span></code> on the owner as well.</p>
@@ -369,9 +369,9 @@ <h2>Assumptions<a class="headerlink" href="#assumptions" title="Permalink to thi
369369
may take down all workers, revert to the previous checkpoint, and resume
370370
training.</p></li>
371371
<li><p><strong>Non-idempotent UDFs</strong>: We assume the user functions (UDF) provided to
372-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.rpc_sync" title="torch.distributed.rpc.rpc_sync"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_sync()</span></code></a>,
373-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_async()</span></code></a> or
374-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> are not idempotent and therefore
372+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.rpc_sync" title="torch.distributed.rpc.rpc_sync"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_sync()</span></code></a>,
373+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_async()</span></code></a> or
374+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> are not idempotent and therefore
375375
cannot be retried. However, internal RRef control messages will be made
376376
idempotent and retryable.</p></li>
377377
<li><p><strong>Out of Order Message Delivery</strong>: We do not assume message delivery order
@@ -395,9 +395,9 @@ <h3>Design Reasoning<a class="headerlink" href="#design-reasoning" title="Permal
395395
<li><p>Creating a new <code class="docutils literal notranslate"><span class="pre">UserRRef</span></code> owned by another worker.</p></li>
396396
</ol>
397397
<p>Case 1 is the simplest where the owner passes its RRef to a user, where the
398-
owner calls <a class="reference internal" href="rpc.html#torch.distributed.rpc.rpc_sync" title="torch.distributed.rpc.rpc_sync"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_sync()</span></code></a>,
399-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_async()</span></code></a>, or
400-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> and uses its RRef as an argument. In this
398+
owner calls <a class="reference internal" href="../rpc.html#torch.distributed.rpc.rpc_sync" title="torch.distributed.rpc.rpc_sync"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_sync()</span></code></a>,
399+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.rpc_async" title="torch.distributed.rpc.rpc_async"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rpc_async()</span></code></a>, or
400+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> and uses its RRef as an argument. In this
401401
case a new <code class="docutils literal notranslate"><span class="pre">UserRRef</span></code> will be created on the user. As the owner is the caller,
402402
it can easily update its local reference count on the <code class="docutils literal notranslate"><span class="pre">OwnerRRef</span></code>.</p>
403403
<p>The only requirement is that any
@@ -451,10 +451,10 @@ <h3>Design Reasoning<a class="headerlink" href="#design-reasoning" title="Permal
451451
<span class="n">A</span> <span class="o">-&gt;</span> <span class="n">Y</span> <span class="o">-&gt;</span> <span class="n">Z</span>
452452
</pre></div>
453453
</div>
454-
<p>If Z calls <a class="reference internal" href="rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_here()</span></code></a> on the <code class="docutils literal notranslate"><span class="pre">UserRRef</span></code>, the
454+
<p>If Z calls <a class="reference internal" href="../rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_here()</span></code></a> on the <code class="docutils literal notranslate"><span class="pre">UserRRef</span></code>, the
455455
owner at least knows A when Z is deleted, because otherwise,
456-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_here()</span></code></a> wouldn’t finish. If Z does not call
457-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_here()</span></code></a>, it is possible that the owner
456+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_here()</span></code></a> wouldn’t finish. If Z does not call
457+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_here()</span></code></a>, it is possible that the owner
458458
receives all messages from Z before any message from A and Y. In this case, as
459459
the real data of the <code class="docutils literal notranslate"><span class="pre">OwnerRRef</span></code> has not been created yet, there is nothing to
460460
be deleted either. It is the same as Z does not exist at all. Hence, it’s still
@@ -487,10 +487,10 @@ <h3>User Share RRef with Owner as Return Value<a class="headerlink" href="#user-
487487
</div>
488488
<p>In this case, the <code class="docutils literal notranslate"><span class="pre">UserRRef</span></code> is created on the user worker A, then it is
489489
passed to the owner worker B together with the remote message, and then B
490-
creates the <code class="docutils literal notranslate"><span class="pre">OwnerRRef</span></code>. The method <a class="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a>
490+
creates the <code class="docutils literal notranslate"><span class="pre">OwnerRRef</span></code>. The method <a class="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a>
491491
returns immediately, meaning that the <code class="docutils literal notranslate"><span class="pre">UserRRef</span></code> can be forked/used before
492492
the owner knows about it.</p>
493-
<p>On the owner, when receiving the <a class="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> call, it
493+
<p>On the owner, when receiving the <a class="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> call, it
494494
will create the <code class="docutils literal notranslate"><span class="pre">OwnerRRef</span></code>, and returns an ACK to acknowledge <code class="docutils literal notranslate"><span class="pre">{100,</span> <span class="pre">1}</span></code>
495495
(<code class="docutils literal notranslate"><span class="pre">RRefId</span></code>, <code class="docutils literal notranslate"><span class="pre">ForkId</span></code>). Only after receiving this ACK, can A delete its
496496
<code class="docutils literal notranslate"><span class="pre">UserRRef</span></code>. This involves both <strong>G1</strong> and <strong>G2</strong>. <strong>G1</strong> is obvious. For
@@ -499,8 +499,8 @@ <h3>User Share RRef with Owner as Return Value<a class="headerlink" href="#user-
499499
<a class="reference internal image-reference" href="https://user-images.githubusercontent.com/16999635/69164772-98181300-0abe-11ea-93a7-9ad9f757cd94.png"><img alt="user_to_owner_ret.png" src="https://user-images.githubusercontent.com/16999635/69164772-98181300-0abe-11ea-93a7-9ad9f757cd94.png" style="width: 500px;" /></a>
500500
<p>The diagram above shows the message flow, where solid arrow contains user
501501
function and dashed arrow are builtin messages. Note that the first two messages
502-
from A to B (<a class="reference internal" href="rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> and
503-
<a class="reference internal" href="rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_here()</span></code></a>) may
502+
from A to B (<a class="reference internal" href="../rpc.html#torch.distributed.rpc.remote" title="torch.distributed.rpc.remote"><code class="xref py py-meth docutils literal notranslate"><span class="pre">remote()</span></code></a> and
503+
<a class="reference internal" href="../rpc.html#torch.distributed.rpc.RRef.to_here" title="torch.distributed.rpc.RRef.to_here"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_here()</span></code></a>) may
504504
arrive at B in any order, but the final delete message will only be sent out
505505
when:</p>
506506
<ul class="simple">
@@ -894,4 +894,4 @@ <h2>Resources</h2>
894894
})
895895
</script>
896896
</body>
897-
</html>
897+
</html>

0 commit comments

Comments
 (0)