-
Notifications
You must be signed in to change notification settings - Fork 775
Description
The following design was implemented in loop versioner many years ago, but never enabled. There are situations where the presence of improper regions is inhibiting some other optimizations (such as global value propagation) from finding opportunities. This design should prevent that.
The goal of this issue is to find, modernize, test, and enable this logic to ensure loop versioner does not create improper regions.
Sometimes, loop versioner performs 'loop transfer', where control flow is changed such that if a virtual guard fails in the 'fast' version of the loop, the virtual call in the 'slow' version will be reached. This this has the benefit that the 'fast' version of the loop does not contain the virtual call anymore, meaning more aggressive optimization is possible (more expressions are loop invariant for example). However this transformation almost always leads to an improper region being created, since control is transferred to a point in the 'slow' loop that is not the entry of the region (the virtual call block is a successor of the virtual guard block, which is also in the 'fast' loop).
One solution for this problem would be to create an 'artificial' entry point in the 'slow' version of the loop, where all the failing virtual guards branch to. All the failing virtual guards store a different value into a temp on the failing path, and there is a switch in the 'artificial' entry that checks the value of the temp to decide where we came from, and then it branches to the appropriate virtual call block within
High-Level Design
Previously, loop-transfer used to redirect control (on the failing path) from the virtual guard in the loop directly to the virtual call block inside the cold version of the loop. This essentially created two entry points to the cold loop leading to the improper region. The failing paths of each virtual guard now store a unique value into a temporary and instead jump to a selector at the entry of the cold version of the loop. The selector is basically a switch statement which redirects control to the appropriate virtual call block in the cold version of the loop.
For example,
Before loop transfer:
hot loop cold loop (cloned of hot)
----------- -------------------------
..... .....
if (virtual-guard) if (virtual-guard)
(success) / \ (fail) (fail) / \ (success)
/ \ / \
inlined code virtual call virtual call inlined code
After loop transfer:
hot loop cold loop (cloned of hot)
----------- -------------------------
..... .....
if (virtual-guard) if (virtual-guard)
(success) / \ (fail) / \ (success)
/ \ / \
inlined code -------------------> virtual call inlined code
This has now created two entry points to the cold loop leading to an improper region.
New loop-transfer:
L1: (new entry)
switch(t)
default --> goto start
1 --> goto L2
hot loop cold loop (cloned of hot)
----------- -------------------------
..... .....
if (virtual-guard) if (virtual-guard)
(success) / \ (fail) (fail) / \ (success)
/ \ / \
inlined code t = 1 L2: virtual call inlined code
goto L1
Multiply nested loops pose a certain difficulty for this implementation. Since loop versioner processes loops outer-first, this requires some extra book-keeping to get the flow of control right:
- when an inner loop contains a vguard that is going to undergo loop transfer, the target of the hot vguard is changed to branch to the switch created at the head of the outer loop. but the switch now jumps directly into the middle of the inner loop. to prevent this, the switch has to jump to the entry of the inner loop. The entry has another selector to transfer control appropriately. some book-keeping is first required to determine which inner loop the control has to be transferred to. once that is done, the right vguard target to jump to in the inner loop has to be determined. as a result, two temporaries are required - one to hold the value of the loop to jump to & the other to figure out the vguard target within the loop.
L1: (new cold outer entry)
switch(t1)
default --> goto start
1 --> goto L2
hot outer loop cold outer loop
--------------- ----------------
L2: (new cold inner entry)
switch(t2)
default --> goto start
1 --> goto L3
hot-inner cold-inner
--------- ----------
if (virtual-guard) if (virtual-guard)
(success) / \ (fail) (fail) / \ (success)
/ \ / \
inlined code t1 = 1 L3: virtual call inlined code
t2 = 1
goto L1
Internals Description
Here is a pseudocode of the algorithm.
VirtualGuardInfo is used to record info about a loop; VirtualGuardPair records the info about virtual guards in the loop.
struct VirtualGuardInfo : public TR_Link<VirtualGuardInfo>
{
VirtualGuardInfo(TR_Compilation *c)
:_virtualGuardPairs(c->trMemory())
{
_virtualGuardPairs.deleteAll();
_coldLoopEntryBlock = NULL;
_coldLoopInvariantBlock = NULL;
_loopEntry = NULL;
}
List<VirtualGuardPair> _virtualGuardPairs;
TR_Block *_coldLoopEntryBlock;
TR_Block *_coldLoopInvariantBlock;
TR_Block *_loopEntry;
};
versionNaturalLoop() performs the cloning of the loop. The implementation uses this routine to perform the book-keeping of the virtual guards so that fixup can be done at the end of the opt.
if (lastTree->getOpCode().isIf() &&
blocksInWhileLoop.find(lastTree->getBranchDestination()->getNode()->getBlock()) &&
lastTree->isTheVirtualGuardForAGuardedInlinedCall())
{
// create the virtual guard info for this loop
//
if (!vgInfo)
{
vgInfo = new (trStackMemory()) VirtualGuardInfo(comp());
vgInfo->_loopEntry = blockAtHeadOfLoop; // record entry of the hot loop
_virtualGuardInfo.add(vgInfo);
}
VirtualGuardPair *virtualGuardPair = (VirtualGuardPair *) trMemory()->allocateStackMemory(sizeof(VirtualGuardPair));
virtualGuardPair->_hotGuardBlock = nextBlock;
virtualGuardPair->_coldGuardBlock = nextClonedBlock;
virtualGuardPair->_isGuarded = false;
// check if the virtual guard is in an inner loop
//
TR_RegionStructure *parent = nextBlock->getStructureOf()->getParent()->asRegion();
if ((parent != whileLoop) &&
(parent && parent->isNaturalLoop()))
{
TR_Block *entry = parent->getEntryBlock();
virtualGuardPair->_coldGuardLoopEntryBlock = entry; //record entry of the inner loop
virtualGuardPair->_isInsideInnerLoop = true;
}
else
virtualGuardPair->_isInsideInnerLoop = false;
vgInfo->_virtualGuardPairs.add(virtualGuardPair);
///_virtualGuardPair.add(virtualGuardPair);
}
The routine performLoopTransfer() performs the actual loop-transfer; in turn calls fixupVirtualGuardTargets() to perform the fixup (creation of switch blocks etc.)
fixupVirtualGuardTargets():
for every vgInfo
{
for (each vgPair)
{
if (vpPair->_isGuarded)
{
if (vgPair->_isInnerLoop)
record distinctTargets[_coldGuardLoopEntry->getNumber]++; // used to build the inner switch block
else
caseKonst++;
}
}
create loopTemp symref (for the outer loop)
numDistinct = 0;
for (i = 0 to numNodesInCFG)
if distinctTargets[i] != 0
invariantBlock = locate the invariant block for _coldGuardLoopEntry
create vgNumTemp = inner temp for the inner switch block
store vgNumTemp = distinctTargets[i] - vgNumTemp // default value of the inner loop
invariantBlock->prepend(vgNumTemp)
create the inner switchblock at _coldGuardLoopEntry
numDistinct++
create the outer switchblock at _coldLoopEntry with caseKonst+numDistinct children
store the default value of loopTemp in the outer cold loop invariant block
int32_t childNum = 0;
for (each vgPair)
{
if (vgPair->_isGuarded)
{
gotoBlock = create the gotoblock to store values into the temps
// fixup the targets of the inner switch block
if (vgPair->_isInnerLoop)
{
loopid = vg->_coldGuardLoopEntry->getNumber
store loopid in loopTemp
vgnum = distinctTargets[loopid]--
fixup case target in the inner switch block child: (switch->getNumChildren-vgnum)
setcaseconstant->distinctTargets[loopid]
setbranchdest->vgPair->_coldGuardBlock->getDest
store vgnum in vgNumTemp
gotoBlock->prepend(vgNumTemp)
// reset the flag in the fail path of the virtual
// guard in the inner cold loop
}
else
{
store childNum++ into loopTemp
}
gotoBlock->prepend(loopTemp)
fixup case target in _coldLoopEntry switchblock childNum+1
}
}
}