Skip to content

Avoid creating improper regions after loop versioner #13243

@0xdaryl

Description

@0xdaryl

The following design was implemented in loop versioner many years ago, but never enabled. There are situations where the presence of improper regions is inhibiting some other optimizations (such as global value propagation) from finding opportunities. This design should prevent that.

The goal of this issue is to find, modernize, test, and enable this logic to ensure loop versioner does not create improper regions.

@a7ehuo @vijaysun-omr


Sometimes, loop versioner performs 'loop transfer', where control flow is changed such that if a virtual guard fails in the 'fast' version of the loop, the virtual call in the 'slow' version will be reached. This this has the benefit that the 'fast' version of the loop does not contain the virtual call anymore, meaning more aggressive optimization is possible (more expressions are loop invariant for example). However this transformation almost always leads to an improper region being created, since control is transferred to a point in the 'slow' loop that is not the entry of the region (the virtual call block is a successor of the virtual guard block, which is also in the 'fast' loop).

One solution for this problem would be to create an 'artificial' entry point in the 'slow' version of the loop, where all the failing virtual guards branch to. All the failing virtual guards store a different value into a temp on the failing path, and there is a switch in the 'artificial' entry that checks the value of the temp to decide where we came from, and then it branches to the appropriate virtual call block within

High-Level Design

Previously, loop-transfer used to redirect control (on the failing path) from the virtual guard in the loop directly to the virtual call block inside the cold version of the loop. This essentially created two entry points to the cold loop leading to the improper region. The failing paths of each virtual guard now store a unique value into a temporary and instead jump to a selector at the entry of the cold version of the loop. The selector is basically a switch statement which redirects control to the appropriate virtual call block in the cold version of the loop.

For example,

Before loop transfer:

             hot loop                                  cold loop (cloned of hot)
           -----------                                -------------------------
          .....                                           ..... 
         if (virtual-guard)                              if (virtual-guard)
(success)   /            \  (fail)                  (fail)  /             \  (success)
           /              \                                /               \
     inlined code      virtual call                  virtual call       inlined code

After loop transfer:

             hot loop                                  cold loop (cloned of hot)
           -----------                                -------------------------
          .....                                           ..... 
         if (virtual-guard)                              if (virtual-guard)
(success)   /            \                          (fail)  /             \  (success)
           /              \                                /               \
     inlined code          ------------------->       virtual call       inlined code

This has now created two entry points to the cold loop leading to an improper region.

New loop-transfer:

                                                     L1: (new entry)
                                                         switch(t)
                                                        default --> goto start 
                                                              1 --> goto L2

             hot loop                                  cold loop (cloned of hot)
           -----------                                -------------------------

          .....                                           ..... 
         if (virtual-guard)                              if (virtual-guard)
(success)   /            \  (fail)                  (fail)  /             \  (success)
           /              \                                /               \
     inlined code      t = 1                    L2:  virtual call       inlined code
                       goto L1  

Multiply nested loops pose a certain difficulty for this implementation. Since loop versioner processes loops outer-first, this requires some extra book-keeping to get the flow of control right:

  • when an inner loop contains a vguard that is going to undergo loop transfer, the target of the hot vguard is changed to branch to the switch created at the head of the outer loop. but the switch now jumps directly into the middle of the inner loop. to prevent this, the switch has to jump to the entry of the inner loop. The entry has another selector to transfer control appropriately. some book-keeping is first required to determine which inner loop the control has to be transferred to. once that is done, the right vguard target to jump to in the inner loop has to be determined. as a result, two temporaries are required - one to hold the value of the loop to jump to & the other to figure out the vguard target within the loop.
                                                     L1: (new cold outer entry)
                                                         switch(t1)
                                                        default --> goto start 
                                                              1 --> goto L2
                                       
        hot outer loop                                  cold outer loop
       ---------------                                 ----------------  
                                                     L2: (new cold inner entry)
                                                         switch(t2)
                                                        default --> goto start 
                                                              1 --> goto L3
                                                      
         hot-inner                                        cold-inner
         ---------                                        ----------

       if (virtual-guard)                              if (virtual-guard)
(success) /             \ (fail)                   (fail) /             \  (success)
         /               \                               /               \
      inlined code       t1 = 1                  L3:  virtual call        inlined code
                         t2 = 1 
                         goto L1

Internals Description

Here is a pseudocode of the algorithm.

VirtualGuardInfo is used to record info about a loop; VirtualGuardPair records the info about virtual guards in the loop.

struct VirtualGuardInfo : public TR_Link<VirtualGuardInfo>
   {
   VirtualGuardInfo(TR_Compilation *c)
      :_virtualGuardPairs(c->trMemory())
      {

      _virtualGuardPairs.deleteAll();
      _coldLoopEntryBlock = NULL;
      _coldLoopInvariantBlock = NULL;
      _loopEntry = NULL;
      }
   List<VirtualGuardPair> _virtualGuardPairs;
   TR_Block *_coldLoopEntryBlock;
   TR_Block *_coldLoopInvariantBlock;
   TR_Block *_loopEntry;
   };

versionNaturalLoop() performs the cloning of the loop. The implementation uses this routine to perform the book-keeping of the virtual guards so that fixup can be done at the end of the opt.

if (lastTree->getOpCode().isIf() &&
    blocksInWhileLoop.find(lastTree->getBranchDestination()->getNode()->getBlock()) &&
          lastTree->isTheVirtualGuardForAGuardedInlinedCall())
         {
         // create the virtual guard info for this loop
         //
         if (!vgInfo)
            {
            vgInfo = new (trStackMemory()) VirtualGuardInfo(comp());
            vgInfo->_loopEntry = blockAtHeadOfLoop;                     // record entry of the hot loop
            _virtualGuardInfo.add(vgInfo);
            }

         VirtualGuardPair *virtualGuardPair = (VirtualGuardPair *) trMemory()->allocateStackMemory(sizeof(VirtualGuardPair));
         virtualGuardPair->_hotGuardBlock = nextBlock;
         virtualGuardPair->_coldGuardBlock = nextClonedBlock;
         virtualGuardPair->_isGuarded = false;
         
         // check if the virtual guard is in an inner loop
         //
         TR_RegionStructure *parent = nextBlock->getStructureOf()->getParent()->asRegion();
         if ((parent != whileLoop) &&
               (parent && parent->isNaturalLoop()))
            {
            TR_Block *entry = parent->getEntryBlock();
            virtualGuardPair->_coldGuardLoopEntryBlock = entry;         //record entry of the inner loop
            virtualGuardPair->_isInsideInnerLoop = true;
            }
         else
            virtualGuardPair->_isInsideInnerLoop = false;

         vgInfo->_virtualGuardPairs.add(virtualGuardPair);
         ///_virtualGuardPair.add(virtualGuardPair);
         }

The routine performLoopTransfer() performs the actual loop-transfer; in turn calls fixupVirtualGuardTargets() to perform the fixup (creation of switch blocks etc.)

fixupVirtualGuardTargets(): 

for every vgInfo 
   {
   for (each vgPair) 
      { 
      if (vpPair->_isGuarded) 
         {
         if (vgPair->_isInnerLoop)
            record distinctTargets[_coldGuardLoopEntry->getNumber]++;  // used to build the inner switch block
         else
            caseKonst++;
         } 
      }

   create loopTemp symref (for the outer loop)

   numDistinct = 0;
   for (i = 0 to numNodesInCFG)
      if distinctTargets[i] != 0
         invariantBlock = locate the invariant block for _coldGuardLoopEntry
         create vgNumTemp = inner temp for the inner switch block
         store  vgNumTemp = distinctTargets[i] - vgNumTemp  // default value of the inner loop 
         invariantBlock->prepend(vgNumTemp)
         create the inner switchblock at _coldGuardLoopEntry
         numDistinct++

   create the outer switchblock at _coldLoopEntry with caseKonst+numDistinct children
   store the default value of loopTemp in the outer cold loop invariant block

   int32_t childNum = 0;
   for (each vgPair)
      {
      if (vgPair->_isGuarded)
         {
         gotoBlock = create the gotoblock to store values into the temps
         
         // fixup the targets of the inner switch block         
         if (vgPair->_isInnerLoop)
            {
            loopid = vg->_coldGuardLoopEntry->getNumber
            store loopid in loopTemp
            vgnum = distinctTargets[loopid]--
            fixup case target in the inner switch block child: (switch->getNumChildren-vgnum)
            setcaseconstant->distinctTargets[loopid] 
            setbranchdest->vgPair->_coldGuardBlock->getDest
            store vgnum in vgNumTemp
            gotoBlock->prepend(vgNumTemp)
            // reset the flag in the fail path of the virtual
            // guard in the inner cold loop
            }
         else
            {
            store childNum++ into loopTemp
            }
         gotoBlock->prepend(loopTemp)
         fixup case target in _coldLoopEntry switchblock childNum+1
         }
      }
   }

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions