- How To Generate A Log File
- Bytecodes And Trees
- Optimization: globalValuePropagation
- Optimization: localCSE
- Optimization: basicBlockExtension
- Optimization: tacticalGlobalRegisterAllocator
- Code Generation
To collect a compilation log file, specify the -Xjit
option with {methodSignature}(traceFull,log=logFileName)
.
The log file is suffixed with a process ID to avoid the previous log files being overwritten by the new ones, for
example JITlog.16996.30451.20210824.114051.16996
. The log file name can also include a full path, and if omitted the
trace log will be generated in the present working directory.
-Xjit:{java/util/HashMap.putVal(ILjava/lang/Object;Ljava/lang/Object;ZZ)Ljava/lang/Object;}(traceFull,log=JITlog)
Wildcard expressions can be used to log multiple methods. For example, {java/util/hashMap.*}
will log every single
method that is compiled in java/util/hashMap
.
traceFull
turns on all important trace types (i.e. traceBC
, traceTrees
, traceCG
, traceOptTrees
, and optDetails
).
Tracing options are defined in OMROptions.cpp.
<!--
MULTIPLE LOG FILES MAY EXIST
Please check for ADDITIONAL log files named: /root/bumblebench/JITlog.1 /root/bumblebench/JITlog.2 /root/bumblebench/JITlog.3 /root/bumblebench/JITlog.4 /root/bumblebench/JITlog.5 /root/bumblebench/JITlog.6
-->
The above message at the beginning of a log file shows the possible logs that could be generated. The example names in the log file are not suffixed with process IDs. A method could be compiled multiple times. There are multiple compilation threads. Each compilation could be done by different compilation threads. There could be multiple log files generated for one method. Each compilation thread writes to its own log file. The max number of logs files is the max number of compilation threads.
-->
<compile
method="java/util/HashMap.putVal(ILjava/lang/Object;Ljava/lang/Object;ZZ)Ljava/lang/Object;"
hotness="cold"
isProfilingCompile=0>
</compile>
=======>java/util/HashMap.putVal(ILjava/lang/Object;Ljava/lang/Object;ZZ)Ljava/lang/Object;
...
This method is cold
...
There could be multiple methods in one log file. Each compilation starts with =======>
. =======>
can be used as an
eye catcher to search other compilations in the same log file. Or <compile
XML tag can be used as an eye catcher.
Another eye catcher is “This method is”
that indicates the hotness level.
+------------- Byte Code Index
| +-------------------- OpCode
| | +------------- First Field
| | | +------------- Branch Target
| | | | +------- Const Pool Index
| | | | | +------------- Constant
| | | | | |
V V V V V V
0, JBaload0getfield
1, JBgetfield 38
4, JBdup
5, JBastore 6
7, JBifnull 12, 19,
10, JBaload 6
12, JBarraylength
13, JBdup
14, JBistore 8
16, JBifne 13, 29,
JIT takes bytecodes as an input and generates IL trees. The first section is the bytecodes that are received from VM.
- Bytecode
#1
getfield
shows the constant pool index38
of the field. - Bytecode
#5
astore
stores the address to the stack slot number6
. - Bytecode
#7
ifnull
tests if the value isNULL
. If it is, it branches to bytecode#19
.
The IL generator generated the following IL which showed up in the section "Pre IlGenOpt Trees"
.
The IL consists of a doubly linked list of trees of nodes (TR::Nodes).
n284n BBStart <block_32> [0x7eff0a557880] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n288n ificmpne --> block_2 BBStart at n29n () [0x7eff0a5579c0] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=2 flg=0x20
n286n iload <count-for-recompile>[#320 Static] [flags 0x2000303 0x40 ] [0x7eff0a557920] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n287n iconst 1 [0x7eff0a557970] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n285n BBEnd </block_32> ===== [0x7eff0a5578d0] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n289n BBStart <block_33> (freq 0) (cold) [0x7eff0a557a10] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
...
n284n
: nodeID, or the GlobalIndex of an TR::Node that uniquely identifies a node. The node's global index can sometimes be used as an index into a bit vector or an array to get some piece of information associated with that specific node. i.e. it serves a purpose that cannot be served using just the node's address.0x7eff0a557880
: The address of the C++ TR::Node object.BBStart
,ificmpne
,BBEnd
, and other treeTops inblock_32
are linked in order.n284n BBStart
has no children in this example.n288n ificmpne
has two children. It comparesn286n
withn287n
. If they are not equal, it branches toblock_2
, otherwise it will fall through toblock_33
.- We can find the IL opcode properties in OMROpcodes.enum to figure out the number of children an IL opcode has and the types of the children.
index: node global index
bci=[x,y,z]: byte-code-info [callee-index, bytecode-index, line-number]
rc: reference count
vc: visit count
vn: value number
li: local index
udi: use/def index
nc: number of children
addr: address size in bytes
flg: node flags
eg: bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=1
At the end of "Pre IlGenOpt Trees"
, there is a description on the abbreviation format. bci
shows which method the
node belongs to, what the bytecode it is, and which line number it is in the original program. For the method being compiled,
the callee index is always -1
. For inlining methods, the callee-index is a non-negative number.
n284n BBStart <block_32> [0x7eff0a557880] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n288n ificmpne --> block_2 BBStart at n29n () [0x7eff0a5579c0] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=2 flg=0x20
n286n iload <count-for-recompile>[#320 Static] [flags 0x2000303 0x40 ] [0x7eff0a557920] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n287n iconst 1 [0x7eff0a557970] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n285n BBEnd </block_32> ===== [0x7eff0a5578d0] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n289n BBStart <block_33> (freq 0) (cold) [0x7eff0a557a10] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n294n istore <recompilation-counter>[#304 Static] [flags 0x1000303 0x40 ] [0x7eff0a557ba0] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=1
n293n iadd [0x7eff0a557b50] bci=[-1,0,628] rc=2 vc=0 vn=- li=- udi=- nc=2
n292n iload <recompilation-counter>[#304 Static] [flags 0x1000303 0x40 ] [0x7eff0a557b00] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n291n iconst -1 [0x7eff0a557ab0] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n296n ificmpgt --> block_2 BBStart at n29n () [0x7eff0a557c40] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=2 flg=0x20
n293n ==>iadd
n295n iconst 0 [0x7eff0a557bf0] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n290n BBEnd </block_33> (cold) ===== [0x7eff0a557a60] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n297n BBStart <block_34> (freq 0) (cold) [0x7eff0a557c90] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n300n istore <recompilation-counter>[#304 Static] [flags 0x1000303 0x40 ] [0x7eff0a557d80] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=1
n299n iconst 0x7fffffff [0x7eff0a557d30] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n302n bstore <gcr-patch-point>[#321 Static] [flags 0x400301 0x40 ] [0x7eff0a557e20] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=1
n301n bconst 2 [0x7eff0a557dd0] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n307n treetop [0x7eff0a557fb0] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=1
n303n icall jitRetranslateCallerWithPreparation[#71 helper Method] [flags 0x400 0x0 ] () [0x7eff0a557e70] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=3 flg=0x20
n304n iconst 0x80000 [0x7eff0a557ec0] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n305n loadaddr <start-PC>[#323 Static] [flags 0x4000303 0x40 ] [0x7eff0a557f10] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n306n loadaddr <J9Method>[#324 Static] [flags 0x8000307 0x40 ] [0x7eff0a557f60] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n298n BBEnd </block_34> (cold) ===== [0x7eff0a557ce0] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n29n BBStart <block_2> [0x7eff0a4c7740] bci=[-1,0,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n34n compressedRefs [0x7eff0a4c78d0] bci=[-1,1,628] rc=0 vc=0 vn=- li=- udi=- nc=2
n32n aloadi java/util/HashMap.table [Ljava/util/HashMap$Node;[#372 Shadow +16] [flags 0x607 0x0 ] [0x7eff0a4c7830] bci=[-1,1,628] rc=3 vc=0 vn=- li=- udi=- nc=1
n31n aload <'this' parm Ljava/util/HashMap;>[#366 Parm] [flags 0x40000107 0x0 ] (X!=0 sharedMemory ) [0x7eff0a4c77e0] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x4
n33n lconst 0 [0x7eff0a4c7880] bci=[-1,1,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n35n astore <auto slot 6>[#373 Auto] [flags 0x7 0x0 ] [0x7eff0a4c7920] bci=[-1,5,628] rc=0 vc=0 vn=- li=- udi=- nc=1
n32n ==>aloadi
n39n ifacmpeq --> block_4 BBStart at n1n () [0x7eff0a4c7a60] bci=[-1,7,628] rc=0 vc=0 vn=- li=- udi=- nc=2 flg=0x20
n32n ==>aloadi
n36n aconst NULL [0x7eff0a4c7970] bci=[-1,7,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n30n BBEnd </block_2> =====
The above example is at the beginning of "Pre IlGenOpt Trees"
in a cold compilation. It shows the trees JIT
instrumented to facilitate the recompilation. Trees in block_32
, block_33
, and block_34
are prefixed by the
Control infrastructure to the method to allow us to recompile the method once JVM startup is done. Because they are
fabricated, we put bytecode index 0
on these nodes.
The static count-for-recompile
variable is created by JIT at cold for each compiled method body. It keeps track
of how many times we are executing the method. After the method is executed multiple times after the startup time
is done, we will decide whether or not to recompile it.
block_32
checks if the count-for-recompile
is not equal to 1
. If it is not equal to 1
, it branches to block_2
.
block 2
is where the normal bytecode for the method starts. Otherwise, if it equals to 1
, it decrements recompilation-counter
in block_33
. If the recompilation-counter
is greater than 0
, it goes to execute the normal bytecode in block_2
.
If it is 0
, it goes to call the helper call jitRetranslateCallerWithPreparation
to recompile the method.
+------------- Byte Code Index
| +-------------------- OpCode
| | +------------- First Field
| | | +------------- Branch Target
| | | | +------- Const Pool Index
| | | | | +------------- Constant
| | | | | |
V V V V V V
1, JBgetfield 38
...
...
n34n compressedRefs [0x7eff0a4c78d0] bci=[-1,1,628] rc=0 vc=0 vn=- li=- udi=- nc=2
n32n aloadi java/util/HashMap.table [Ljava/util/HashMap$Node;[#372 Shadow +16] [flags 0x607 0x0 ] [0x7eff0a4c7830] bci=[-1,1,628] rc=3 vc=0 vn=- li=- udi=- nc=1
n31n aload <'this' parm Ljava/util/HashMap;>[#366 Parm] [flags 0x40000107 0x0 ] (X!=0 sharedMemory ) [0x7eff0a4c77e0] bci=[-1,0,628] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x4
n33n lconst 0 [0x7eff0a4c7880] bci=[-1,1,628] rc=1 vc=0 vn=- li=- udi=- nc=0
Bytecode #1
getfield
is translated into IL n32n aloadi
to load an address. "i"
is a suffix meaning it is an
indirect load. Indirect load means we don’t have an address to map to the field directly. We need another object to
load off from. The address load loads something pointing to an object or a reference field.
n32n aloadi
is an indirect load of java/util/HashMap.table
. Its type is an array: [Ljava/util/HashMap$Node;
. The
second child of n32n aloadi
is the base object (n31n aload
). It indirectly loads field java/util/HashMap.table
off an object of Ljava/util/HashMap;
.
The symbol reference in n32n aloadi
tell us what we are loading. The symbol reference number is #372
. Shadow
is
an indirect access. It is not static, or local variable. It indicates that another pointer is required to complete the
memory access. The field offset is 16
within Ljava/util/HashMap object.
The child node (n31n aload
) is evaluated first, and it is put into a register. Then offset 16
is added to it to get
the location of the field to do a memory access at the address.
n34n compressedRefs
node indicates a field access or an array element access here requires compression/decompression
to be done. The second child (n33n lconst 0
) of the compressedRefs
is the start of the heap. It supports the case
where the heap base is non-zero value. Field access is stored in a compressed 4 bytes form. For a load, it requires
decompression. For a store, it requires compression before storing into the field.
Nodes in a tree are indented according to their depth in the tree. Indentation level indicates the parent-child relationship.
n34n compressedRefs
has two children. The first child n32n aloadi
has its own child n31n aload
which is indented again.
n33n lconst
is the second child of n34n compressedRefs
.
+------------- Byte Code Index
| +-------------------- OpCode
| | +------------- First Field
| | | +------------- Branch Target
| | | | +------- Const Pool Index
| | | | | +------------- Constant
| | | | | |
V V V V V V
5, JBastore 6
7, JBifnull 12, 19,
...
...
n35n astore <auto slot 6>[#373 Auto] [flags 0x7 0x0 ] [0x7eff0a4c7920] bci=[-1,5,628] rc=0 vc=0 vn=- li=- udi=- nc=1
n32n ==>aloadi
n39n ifacmpeq --> block_4 BBStart at n1n () [0x7eff0a4c7a60] bci=[-1,7,628] rc=0 vc=0 vn=- li=- udi=- nc=2 flg=0x20
n32n ==>aloadi
n36n aconst NULL
n35n astore
stores the value into auto
slot 6
corresponding to the symbol table reference #373
.
Some symbol reference numbers are reserved for known methods such as various helpers: #71 jitRetranslateCallerWithPreparation
,
#323 start-PC
, etc. For internal known symbols, we generate these symbol references in advance.
auto
can be substitute as local variable. They are variables allocated on the stack most of the time. Sometimes they
correspond to local variables that exist in the program, or they could be temporary variables that JIT introduces.
flags
are properties of a node, for example nodeIsNonNull
.
n32n ==>aloadi
is the child of n35n astore
. It is the rhs value that is being stored. ==>
represents commoning,
i.e. the value of the expression has already been computed. It is a back reference to a note whose value has already
been computed at an earlier point in the (extended) basic block. The node global index (n32n
in this case) can be
used to search for the first reference of this node, which is n32n aloadi java/util/HashMap.table
.
This store consumes the value that is evaluated before. We are not doing load again
here for n35n astore
. We already did the load before. We are going to reuse the register value from the previous
computation. n32n
is also reused by n39n ifacmpeq
for NULL
comparison. If it is equal to NULL
, it branches to
block_4
. If it is not equal, it falls through to the next block.
+------------- Byte Code Index
| +-------------------- OpCode
| | +------------- First Field
| | | +------------- Branch Target
| | | | +------- Const Pool Index
| | | | | +------------- Constant
| | | | | |
V V V V V V
10, JBaload 6
12, JBarraylength
13, JBdup
14, JBistore 8
16, JBifne 13, 29,
...
...
n37n BBStart <block_3> [0x7eff0a4c79c0] bci=[-1,10,628] rc=0 vc=0 vn=- li=- udi=- nc=0
n42n NULLCHK on n40n [#32] [0x7eff0a4c7b50] bci=[-1,12,628] rc=0 vc=0 vn=- li=- udi=- nc=1
n41n arraylength [0x7eff0a4c7b00] bci=[-1,12,628] rc=3 vc=0 vn=- li=- udi=- nc=1
n40n aload <auto slot 6>[#373 Auto] [flags 0x7 0x0 ] [0x7eff0a4c7ab0] bci=[-1,10,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n43n istore <auto slot 8>[#374 Auto] [flags 0x3 0x0 ] [0x7eff0a4c7ba0] bci=[-1,14,628] rc=0 vc=0 vn=- li=- udi=- nc=1
n41n ==>arraylength
n45n ificmpne --> block_5 BBStart at n3n () [0x7eff0a4c7c40] bci=[-1,16,628] rc=0 vc=0 vn=- li=- udi=- nc=2 flg=0x20
n41n ==>arraylength
n44n iconst 0 [0x7eff0a4c7bf0] bci=[-1,16,628] rc=1 vc=0 vn=- li=- udi=- nc=0
n38n BBEnd </block_3> =====
n42n NULLCHK
: There is no NULL
check in bytecodes, but we must raise a NullPointerException
if the n40n aload
is NULL
. When accessing n31n aload
in the previous example, we don’t need to do a NULL
check because we assume
this pointer is non-NULL. If this pointer is NULL
, this method will not even have been entered.
n43n istore
: Stores the arrayLength
value to stack auto
slot 8
.
n45n ificmpne
: Tests if arrayLength
equals to 0
. If yes, it branches to block_5
. If no, it falls through to the
next block.
The reason that we have separate trees or treetops is because we want to order side effects in a program. The inexplicit order of treetop indicates the order of side effects. They happen one after another.
Side effects mean that the result of an operation will have an effect even after the trees are executed. Throwing an
exception, changing memory, and branching are known side effect operations. If a NULL
check fails, it goes to a
completely different block. Doing a store is going to change a memory location. The memory is changed when it gets to
evaluate the next treetop. When branching happens, the effect of that is felt by whatever comes next.
Only one side effect exits in each treetop. We do not want one tree to have a NULL
check, and a store, and branching.
Global Value Propagation is important because it is one place where it has the most knowledge of what the semantics are for different opcodes.
<optimization id=42 name=globalValuePropagation method=net/adoptopenjdk/bumblebench/collections/ArrayListForEachLambdaBench.doBatch(J)J>
Performing 42: globalValuePropagation
(Building use/def info)
PREPARTITION VN (Building value number info)
[ 4830] O^O VALUE PROPAGATION: Constant folding lload [00007F00A30046D0] to lconst 0
[ 4831] O^O VALUE PROPAGATION: Add known-object constraint to 00007F00A3004AE0 based on known object obj1 of class Ljava/lang/invoke/ConstructorHandle;
Abort transformIndirectLoadChain - cannot dereference at compile time!
Abort transformIndirectLoadChain - cannot dereference at compile time!
[ 4832] O^O VALUE PROPAGATION: Removing redundant call to jitCheckIfFinalize [00007F00A30A35C0]
[ 4833] O^O VALUE PROPAGATION: Replace PassThrough node [00007F00A30A35C0] with its child in its parent [00007F00A30A3570]
[ 4834] O^O VALUE PROPAGATION: Changing wrtbar to store because the rhs is a new object [00007F00A30A3070]
[ 4835] O^O VALUE PROPAGATION: Removing redundant null check node [00007F00A30AC0D0]
globalValuePropagation
performs constant propagation and type propagation. Global optimization looks at all the blocks
including the inlined code. It runs several times in scorching
and hot
compilations. It does not run for cold
compilation. It can be expensive for cold
compilations.
[ 4830] O^O VALUE PROPAGATION: Constant folding lload [00007F00A30046D0] to lconst 0
Before VP:
n13n iflcmpge --> block_68 BBStart at n755n () [0x7f00a3004810] bci=[-1,5,25] rc=0 vc=351 vn=- li=-1 udi=- nc=2 flg=0x20
n9n lload i<auto slot 3>[#352 Auto] [flags 0x4 0x0 ] (cannotOverflow ) [0x7f00a30046d0] bci=[-1,2,25] rc=1 vc=351 vn=- li=- udi=- nc=0 flg=0x1000
n10n lload numIterations<parm 1 J>[#351 Parm] [flags 0x40000104 0x0 ] (cannotOverflow ) [0x7f00a3004720] bci=[-1,3,25] rc=1 vc=351 vn=- li=- udi=- nc=0 flg=0x1000
n4n BBEnd </block_3> =====
After VP:
n13n iflcmpge --> block_68 BBStart at n755n () [0x7f00a3004810] bci=[-1,5,25] rc=0 vc=380 vn=- li=- udi=- nc=2 flg=0x20
n9n lconst 0 (X==0 X>=0 X<=0 cannotOverflow ) [0x7f00a30046d0] bci=[-1,2,25] rc=1 vc=380 vn=- li=12 udi=- nc=0 flg=0x1302
n10n lload numIterations<parm 1 J>[#351 Parm] [flags 0x40000104 0x0 ] (cannotOverflow ) [0x7f00a3004720] bci=[-1,3,25] rc=1 vc=380 vn=- li=- udi=39 nc=0 flg=0x1000
n4n BBEnd </block_3> =====
If there is a value and it is a constant, the value is folded with the constant.
We can search the node address 0x7f00a30046d0
to trace its transformations. Before VP, node 0x7f00a30046d0
(n9n
)
is lload i<auto slot 3>
. It loads from stack auto slot 3
. The variable name is i
in the source code.
After VP, node 0x7f00a30046d0
(n9n
) is transformed into lconst 0
. It is still the same node with the same address
but it has a different opcode now.
[ 4830] O^O VALUE PROPAGATION: Constant folding lload [00007F00A30046D0] to lconst 0
performTransformation prints out the message on the above and guards whether or not the transformation performs.
[ 4830]
indicates the number of transformations that have occurred so far. To narrow down the transformation that
causes a problem, we can do a binary search on these numbers by using lastOptIndex
. If lastOptIndex=4830
, none of
the transformations after 4830
will take place.
[ 5584] O^O VALUE PROPAGATION: Removing redundant checkcast node [00007FEF2577CB70]
Before VP:
n191n checkcast [#86] [0x7fef2577cb70] bci=[1,7,-] rc=0 vc=312 vn=- li=-1 udi=- nc=2
n189n aload <auto slot 2>[#391 Auto] [flags 0x7 0x0 ] [0x7fef2577cad0] bci=[1,6,-] rc=1 vc=312 vn=- li=- udi=- nc=0
n190n loadaddr java/lang/invoke/MemberName[#392 Static] [flags 0x18307 0x0 ] [0x7fef2577cb20] bci=[1,7,-] rc=1 vc=312 vn=- li=- udi=- nc=0
After VP:
n191n treetop [0x7fef2577cb70] bci=[1,7,-] rc=0 vc=328 vn=- li=- udi=- nc=1
n189n aload <auto slot 2>[#391 Auto] [flags 0x7 0x0 ] (X!=0 sharedMemory ) [0x7fef2577cad0] bci=[1,6,-] rc=1 vc=328 vn=- li=- udi=39 nc=0 flg=0x4
Before VP, 0x7fef2577cb70
was loading from auto slot 2
and checkast
against java/lang/invoke/MemberName
.
After VP, 0x7fef2577cb70
is converted into a treetop
from a checkcast
node.
[ 5585] O^O VALUE PROPAGATION: Removing redundant null check node [00007FEF2577FE10]
Before VP:
n353n NULLCHK on n355n [#32] [0x7fef2577fe10] bci=[3,0,-] rc=0 vc=312 vn=- li=-1 udi=- nc=1
n352n PassThrough [0x7fef2577fdc0] bci=[3,0,-] rc=1 vc=312 vn=- li=- udi=- nc=1
n355n aload <temp slot 3>[#413 Auto] [flags 0x20000007 0x0 ] [0x7fef2577feb0] bci=[3,5,-] rc=1 vc=312 vn=- li=- udi=- nc=0
After VP:
n353n treetop [0x7fef2577fe10] bci=[3,0,-] rc=0 vc=328 vn=- li=- udi=- nc=1
n355n aload <temp slot 3>[#413 Auto] [flags 0x20000007 0x0 ] (X!=0 sharedMemory ) [0x7fef2577feb0] bci=[3,5,-] rc=1 vc=328 vn=- li=- udi=40 nc=0 flg=0x4
Originally, we had a NULLCHK
on temp slot 3
before VP. After VP, NULLCHK
node is converted a noop treetop
node
to hang on the n355n aload
.
[ 5588] O^O VALUE PROPAGATION: Removing conditional branch [00007FEF2577EF60] ificmpne
Before VP:
n331n astore <temp slot 3>[#413 Auto] [flags 0x20000007 0x0 ] (X!=0 privatizedInlinerArg sharedMemory ) [0x7fef2577f730] bci=[3,0,-] rc=0 vc=312 vn=- li=-1 udi=- nc=1 flg=0x2004
n238n ==>new
…
n353n NULLCHK on n355n [#32] [0x7fef2577fe10] bci=[3,0,-] rc=0 vc=312 vn=- li=-1 udi=- nc=1
n352n PassThrough [0x7fef2577fdc0] bci=[3,0,-] rc=1 vc=312 vn=- li=- udi=- nc=1
n355n aload <temp slot 3>[#413 Auto] [flags 0x20000007 0x0 ] [0x7fef2577feb0] bci=[3,5,-] rc=1 vc=312 vn=- li=- udi=- nc=0
n306n ificmpne --> block_21 BBStart at n312n () [0x7fef2577ef60] bci=[5,0,39] rc=0 vc=312 vn=- li=-1 udi=- nc=2 flg=0x20
n605n iand (X>=0 cannotOverflow ) [0x7fef25784cd0] bci=[5,0,39] rc=1 vc=312 vn=- li=- udi=- nc=2 flg=0x1100
n604n l2i [0x7fef25784c80] bci=[5,0,39] rc=1 vc=312 vn=- li=- udi=- nc=1
n302n lloadi <isClassAndDepthFlags>[#279 Shadow +24] [flags 0x603 0x0 ] (cannotOverflow ) [0x7fef2577ee20] bci=[5,0,39] rc=1 vc=312 vn=- li=- udi=- nc=1 flg=0x1000
n301n aloadi <vft-symbol>[#282 Shadow] [flags 0x18607 0x0 ] (X!=0 X>=0 sharedMemory ) [0x7fef2577edd0] bci=[5,0,39] rc=1 vc=312 vn=- li=- udi=- nc=1 flg=0x104
n297n aload <temp slot 3>[#413 Auto] [flags 0x20000007 0x0 ] (X!=0 X>=0 sharedMemory ) [0x7fef2577ec90] bci=[5,0,39] rc=1 vc=312 vn=- li=- udi=- nc=0 flg=0x104
n602n iconst 0x40200000 (X!=0 X>=0 ) [0x7fef25784be0] bci=[5,0,39] rc=1 vc=312 vn=- li=- udi=- nc=0 flg=0x104
n603n iconst 0 (X==0 X>=0 X<=0 ) [0x7fef25784c30] bci=[5,0,39] rc=1 vc=312 vn=- li=- udi=- nc=0 flg=0x302
n311n BBEnd </block_4> ===== [0x7fef2577f0f0] bci=[5,0,39] rc=0 vc=294 vn=- li=-1 udi=- nc=0
After VP:
n331n astore <temp slot 3>[#413 Auto] [flags 0x20000007 0x0 ] (X!=0 privatizedInlinerArg sharedMemory ) [0x7fef2577f730] bci=[3,0,-] rc=0 vc=312 vn=- li=-1 udi=- nc=1 flg=0x2004
n238n ==>new
…
n353n treetop [0x7fef2577fe10] bci=[3,0,-] rc=0 vc=328 vn=- li=- udi=- nc=1
n355n aload <temp slot 3>[#413 Auto] [flags 0x20000007 0x0 ] (X!=0 sharedMemory ) [0x7fef2577feb0] bci=[3,5,-] rc=1 vc=328 vn=- li=- udi=40 nc=0 flg=0x4
n311n BBEnd </block_4> ===== [0x7fef2577f0f0] bci=[5,0,39] rc=0 vc=328 vn=- li=- udi=- nc=0
This is an example of branch folding. n306n ificmpne
checks the bits in classAndDepthFlags
of J9Class
from the
object on temp slot 3
. We know temp slot 3
stores a newly created object (n331n astore
). We might know the type
of the object and what its classAndDepthFlags
are during the compilation time. Based on the value of classAndDepthFlags
,
it knows it will fall through to the next block. The ificmpne
is completed removed.
[ 5589] O^O VALUE PROPAGATION: Changing wrtbar to store because the rhs is a new object [00007FEF2577E600]
Before VP:
n276n awrtbari com/ibm/bumblebench/collections/ArrayListForEachLambdaBench$$Lambda$44/0x0000000000000000.arg$1 Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;[#410 final Shadow +8] [flags 0xa0607 0x0 ] (sharedMemory ) [0x7fef2577e600] bci=[4,6,-] rc=1 vc=312 vn=- li=- udi=- nc=3 flg=0x20
After VP:
n276n awrtbari com/ibm/bumblebench/collections/ArrayListForEachLambdaBench$$Lambda$44/0x0000000000000000.arg$1 Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;[#410 final Shadow +8] [flags 0xa0607 0x0 ] (X!=0 skipWrtBar sharedMemory ) [0x7fef2577e600] bci=[4,6,-] rc=1 vc=328 vn=- li=- udi=- nc=3 flg=0x824
The rhs of n276n awrtbari
node is a new object. We don’t need to do the bookkeeping. We put skipWrtBar
on the node.
When it gets to the codegen, the write barrier sequence will not be generated, instead only the store will be generated.
[ 5591] O^O VALUE PROPAGATION: Changing an indirect call com/ibm/bumblebench/collections/ArrayListForEachLambdaBench$$Lambda$44/0x0000000000000000.accept(Ljava/lang/Object;)V (calli) to a direct call [00007FEF25782200]
Before VP:
n468n calli java/util/function/Consumer.accept(Ljava/lang/Object;)V[#436 unresolved interface Method] (Interface class) [flags 0x400 0x0 ] () [0x7fef25782200] bci=[6,48,1541] rc=1 vc=312 vn=- li=- udi=- nc=3 flg=0x20
After VP:
n468n call com/ibm/bumblebench/collections/ArrayListForEachLambdaBench$$Lambda$44/0x0000000000000000.accept(Ljava/lang/Object;)V[#445 final virtual Method -64] [flags 0x20500 0x0 ] (virtualCallNodeForAGuardedInlinedCall ) [0x7fef25782200] bci=[6,48,1541] rc=1 vc=328 vn=- li=- udi=- nc=2 flg=0x820
With type information, the target of the call n468n calli
is known at the compilation time. The indirect call calli
is changed into a direct call call
, de-virtualized. When we de-virtualize a call in VP, we also try to inline it.
localCSE (Local Common Subexpression Elimination) operates in a single extended basic block. It commons nodes to avoid repeating common computations, and it increases the reference count on the nodes. Commoning allows us to reuse the same register as much as we can. localCSE also does local copy propagation: Checking the store nodes and propagating the value to other nodes that use the stored value.
<optimization id=109 name=localCSE method=com/ibm/bumblebench/collections/ArrayListForEachLambdaBench.doBatch(J)J>
[ 5833] O^O LOCAL COMMON SUBEXPRESSION ELIMINATION: Local Common Subexpression Elimination commoning node : 00007FEF25A9F140 by available node : 00007FEF25A9F000
Before localCSE:
n623n BBStart <block_55> (freq 6) (loop pre-header) [0x7fef25785270] bci=[-1,1,12] rc=0 vc=1318 vn=- li=-1 udi=- nc=0
n1128n astore <temp slot 8>[#465 Auto] [flags 0x7 0x0 ] [0x7fef25a9f050] bci=[-1,18,14] rc=0 vc=1318 vn=- li=- udi=- nc=1
n1127n aload <callSite entry @0 0x3455e0>[#359 Static] (obj1) [flags 0x307 0x8 ] [0x7fef25a9f000] bci=[-1,18,14] rc=1 vc=1318 vn=- li=5 udi=- nc=0
n1135n compressedRefs [0x7fef25a9f280] bci=[-1,18,14] rc=0 vc=1318 vn=- li=- udi=- nc=2
n1129n aloadi unknown field[#360 Shadow] (obj2) [flags 0x80000607 0x0 ] [0x7fef25a9f0a0] bci=[-1,18,14] rc=2 vc=1318 vn=- li=6 udi=- nc=1
n1130n aladd (internalPtr sharedMemory ) [0x7fef25a9f0f0] bci=[-1,18,14] rc=1 vc=1318 vn=- li=-1 udi=- nc=2 flg=0x8000
n1131n aload <callSite entry @0 0x3455e0>[#359 Static] (obj1) [flags 0x307 0x8 ] [0x7fef25a9f140] bci=[-1,18,14] rc=1 vc=1318 vn=- li=5 udi=- nc=0
n1132n lconst 20 [0x7fef25a9f190] bci=[-1,18,14] rc=1 vc=1318 vn=- li=-1 udi=- nc=0
n1134n lconst 0 [0x7fef25a9f230] bci=[-1,18,14] rc=1 vc=1318 vn=- li=- udi=- nc=0
n1133n astore <temp slot 9>[#466 Auto] [flags 0x7 0x0 ] [0x7fef25a9f1e0] bci=[-1,18,14] rc=0 vc=1318 vn=- li=- udi=- nc=1
n1129n ==>aloadi
n624n BBEnd </block_55> =====
After localCSE:
n623n BBStart <block_55> (freq 6) (loop pre-header) [0x7fef25785270] bci=[-1,1,12] rc=0 vc=1318 vn=- li=-1 udi=- nc=0
n1128n astore <temp slot 8>[#465 Auto] [flags 0x7 0x0 ] [0x7fef25a9f050] bci=[-1,18,14] rc=0 vc=1409 vn=- li=- udi=- nc=1
n1127n aload <callSite entry @0 0x3455e0>[#359 Static] (obj1) [flags 0x307 0x8 ] [0x7fef25a9f000] bci=[-1,18,14] rc=2 vc=1409 vn=- li=- udi=- nc=0
n1135n compressedRefs [0x7fef25a9f280] bci=[-1,18,14] rc=0 vc=1409 vn=- li=- udi=- nc=2
n1129n aloadi unknown field[#360 Shadow] (obj2) [flags 0x80000607 0x0 ] [0x7fef25a9f0a0] bci=[-1,18,14] rc=2 vc=1409 vn=- li=- udi=- nc=1
n1130n aladd (internalPtr sharedMemory ) [0x7fef25a9f0f0] bci=[-1,18,14] rc=1 vc=1409 vn=- li=- udi=- nc=2 flg=0x8000
n1127n ==>aload
n1132n lconst 20 [0x7fef25a9f190] bci=[-1,18,14] rc=1 vc=1409 vn=- li=- udi=- nc=0
n1134n lconst 0 [0x7fef25a9f230] bci=[-1,18,14] rc=1 vc=1409 vn=- li=- udi=- nc=0
n1133n astore <temp slot 9>[#466 Auto] [flags 0x7 0x0 ] [0x7fef25a9f1e0] bci=[-1,18,14] rc=0 vc=1409 vn=- li=- udi=- nc=1
n1129n ==>aloadi
n624n BBEnd </block_55> =====
Both n1127n aload
and n1131n aload
load the same call site with symRef #359
. After localCSE, The child of
n1130n aladd
back references the earlier n1127n aload
. Before localCSE, the same computation repeats twice.
After the commoning transformation is done, when it gets to the codegen, n1127n aload
is evaluated first and the
value is put into a register. The next time, it will not be evaluated. We will just pick up the register value
for n1127n
when it is referenced under n1130n
.
[ 5837] O^O LOCAL COMMON SUBEXPRESSION ELIMINATION: Local Common Subexpression Elimination propagating local #415 in node : 00007FEF2577E560 PARENT : 00007FEF2577E600 from node 00007FEF2577FF50
Before localCSE:
357n astore <temp slot 2>[#415 Auto] [flags 0x7 0x0 ] (X!=0 sharedMemory ) [0x7fef2577ff50] bci=[3,0,-] rc=0 vc=1320 vn=- li=-1 udi=5 nc=1 flg=0xc
n238n ==>new
n278n compressedRefs [0x7fef2577e6a0] bci=[4,6,-] rc=0 vc=1320 vn=- li=-1 udi=- nc=2
n276n awrtbari com/ibm/bumblebench/collections/ArrayListForEachLambdaBench$$Lambda$44/0x0000000000000000.arg$1 Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;[#410 final Shadow +8] [flags 0xa0607 0x0 ] (X!=0 skipWrtBar sharedMemory ) [0x7fef2577e600] bci=[4,6,-] rc=1 vc=1320 vn=- li=9 udi=- nc=3 flg=0x824
n274n aload <temp slot 2>[#415 Auto] [flags 0x7 0x0 ] (X!=0 X>=0 sharedMemory ) [0x7fef2577e560] bci=[4,4,-] rc=2 vc=1320 vn=- li=8 udi=56 nc=0 flg=0x104
n275n aload this<'this' parm Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;>[#351 Parm] [flags 0x40000107 0x0 ] (X!=0 sharedMemory ) [0x7fef2577e5b0] bci=[4,5,-] rc=1 vc=1320 vn=- li=2 udi=57 nc=0 flg=0x4
n274n ==>aload
n277n lconst 0 (highWordZero X==0 X>=0 X<=0 )
After localCSE:
n357n astore <temp slot 2>[#415 Auto] [flags 0x7 0x0 ] (X!=0 sharedMemory ) [0x7fef2577ff50] bci=[3,0,-] rc=0 vc=1413 vn=- li=- udi=5 nc=1 flg=0xc
n238n ==>new
n278n compressedRefs [0x7fef2577e6a0] bci=[4,6,-] rc=0 vc=1413 vn=- li=- udi=- nc=2
n276n awrtbari com/ibm/bumblebench/collections/ArrayListForEachLambdaBench$$Lambda$44/0x0000000000000000.arg$1 Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;[#410 final Shadow +8] [flags 0xa0607 0x0 ] (X!=0 skipWrtBar sharedMemory ) [0x7fef2577e600] bci=[4,6,-] rc=1 vc=1413 vn=- li=- udi=- nc=3 flg=0x824
n238n ==>new
n14n ==>aload
n238n ==>new
n15n ==>lconst 0
n274n aload
loads an object with symRef #415
from the temp slot 2
. Prior to it, n357n astore
stores a new object
(n238n
) with the same symRef #415
to the temp slot 2
. The new object (n238n
) would already be in a register when
it gets to evaluate n274n
. We could just pick up the value directly rather than doing a store first and then loading
from the memory (n274n
). After the localCSE transformation, the first child (n274n aload
) of n276n awrtbari
is
converted into a direct or back reference to n238n new
, which is the first child of n357n astore
.
n357n astore
is still there after localCSE. It is not the job of localCSE to clean it up. It will be the job of
deadTreeElimination
to do it later if no load is driven by this n357n astore
.
Before basicBlockExtension:
n1279n BBStart <block_126> (freq 6) [0x7fef25951f80] bci=[16,6,15] rc=0 vc=6377 vn=- li=- udi=- nc=0
n1296n NULLCHK on n1298n [#32] [0x7fef259524d0] bci=[15,1,-] rc=0 vc=6377 vn=- li=- udi=- nc=1
n1237n lloadi com/ibm/bumblebench/collections/ArrayListForEachLambdaBench._sum J[#354 Shadow +64] [flags 0x80604 0x0 ] (cannotOverflow ) [0x7fef25951260] bci=[16,2,15] rc=2 vc=6377 vn=- li=- udi=- nc=1 flg=0x1000
n1298n aload <temp slot 6>[#511 Auto] [flags 0x20000007 0x0 ] (X>=0 sharedMemory ) [0x7fef25952570] bci=[15,8,-] rc=1 vc=6377 vn=- li=- udi=235 nc=0 flg=0x100
n1271n lstore <temp slot 5>[#510 Auto] [flags 0x4 0x0 ] [0x7fef25951d00] bci=[16,2,15] rc=0 vc=6377 vn=- li=- udi=67 nc=1
n1237n ==>lloadi
n4264n BBEnd </block_126> =====
After basicBlockExtension:
n1279n BBStart <block_126> (freq 6) (extension of previous block) (in loop 82) [0x7fef25951f80] bci=[16,6,15] rc=0 vc=6377 vn=- li=- udi=- nc=0
n1296n NULLCHK on n1298n [#32] [0x7fef259524d0] bci=[15,1,-] rc=0 vc=6377 vn=- li=- udi=- nc=1
n1237n lloadi com/ibm/bumblebench/collections/ArrayListForEachLambdaBench._sum J[#354 Shadow +64] [flags 0x80604 0x0 ] (cannotOverflow ) [0x7fef25951260] bci=[16,2,15] rc=2 vc=6377 vn=- li=- udi=- nc=1 flg=0x1000
n1298n aload <temp slot 6>[#511 Auto] [flags 0x20000007 0x0 ] (X>=0 sharedMemory ) [0x7fef25952570] bci=[15,8,-] rc=1 vc=6377 vn=- li=- udi=235 nc=0 flg=0x100
n1271n lstore <temp slot 5>[#510 Auto] [flags 0x4 0x0 ] [0x7fef25951d00] bci=[16,2,15] rc=0 vc=6377 vn=- li=- udi=67 nc=1
n1237n ==>lloadi
n4264n BBEnd </block_126>
A basic block has a single entry and multiple exits. Up until the path of basicBlockExtension
, the blocks are not extended.
There is only one way to enter to block_126
, which is the predecessor falling through to it. After basicBlockExtension
,
block_126
becomes an extension of previous block.
Marking blocks as extensions of previous blocks doesn’t result in any better code. Once they are marked as extensions
of previous block, optimizations like localCSE could run again. It can common up values in block_126
with the values
in its predecessors from the same extended block. It allows the expansion of other local optimizations beyond just one
block. Other local optimizations can now operate on the extended blocks.
n1274n astore <temp slot 6>[#511 Auto] [flags 0x20000007 0x0 ] (privatizedInlinerArg sharedMemory ) [0x7fef25951df0] bci=[15,1,-] rc=0 vc=6377 vn=- li=- udi=66 nc=1 flg=0x2000
n1225n ==>aloadi
n4262n BBEnd </block_131> [0x7fef2641c3f0] bci=[16,0,15] rc=0 vc=0 vn=- li=- udi=- nc=0
n1279n BBStart <block_126> (freq 6) (extension of previous block) (in loop 82) [0x7fef25951f80] bci=[16,6,15] rc=0 vc=6377 vn=- li=- udi=- nc=0
n1296n NULLCHK on n1298n [#32] [0x7fef259524d0] bci=[15,1,-] rc=0 vc=6377 vn=- li=- udi=- nc=1
n1237n lloadi com/ibm/bumblebench/collections/ArrayListForEachLambdaBench._sum J[#354 Shadow +64] [flags 0x80604 0x0 ] (cannotOverflow ) [0x7fef25951260] bci=[16,2,15] rc=2 vc=6377 vn=- li=- udi=- nc=1 flg=0x1000
n1298n aload <temp slot 6>[#511 Auto] [flags 0x20000007 0x0 ] (X>=0 sharedMemory ) [0x7fef25952570] bci=[15,8,-] rc=1 vc=6377 vn=- li=- udi=235 nc=0 flg=0x100
localCSE runs again after basicBlockExtension
n1296n NULLCHK on n1225n [#32] [0x7fef259524d0] bci=[15,1,-] rc=0 vc=6450 vn=- li=- udi=- nc=1
n1237n lloadi com/ibm/bumblebench/collections/ArrayListForEachLambdaBench._sum J[#354 Shadow +64] [flags 0x80604 0x0 ] (cannotOverflow ) [0x7fef25951260] bci=[16,2,15] rc=3 vc=6450 vn=- li=- udi=- nc=1 flg=0x1000
n1225n ==>aloadi
n1298n aload
loads symbol #511
from temp slot 6
in block_126
. #511
is loaded by n1225n aloadi
and stored to
temp slot 6
by n1274n astore
in block_131
. After basicBlockExtension
and localCSE
runs again which does copy
propagation. The first child of n1237n lloadi
is no longer a load of temp n1298n aload
but now it is commoned to
the rhs (n1225n aloadi
) of the n1274n astore
.
tacticalGlobalRegisterAllocator
(GRA) extends the register setup between multiple different extended blocks. Memory
loads/stores become register loads/stores. Instead of referencing stack slots, nodes access registers for values.
Before GRA:
n638n BBStart <block_53> (freq 509) [0x7fef25795720] bci=[-1,0,12] rc=0 vc=55 vn=- li=- udi=- nc=0
n642n ificmpeq --> block_2 BBStart at n5n (profilingCode ) [0x7fef25795860] bci=[-1,0,12] rc=0 vc=55 vn=- li=- udi=- nc=2 flg=0xa0
n640n iload 0x7fef247330cc[#479 Static] [flags 0x10303 0x8040 ] (cannotOverflow ) [0x7fef257957c0] bci=[-1,0,12] rc=1 vc=55 vn=- li=- udi=- nc=0 flg=0x1000
n641n iconst -1 (X!=0 X<=0 ) [0x7fef25795810] bci=[-1,0,12] rc=1 vc=55 vn=- li=- udi=- nc=0 flg=0x204
n639n BBEnd </block_53>
…
n1097n BBStart <block_109> (freq 509) (extension of previous block) [0x7fef2594e6a0] bci=[-1,1,12] rc=0 vc=55 vn=- li=- udi=- nc=0
n1102n iflcmple --> block_5 BBStart at n1n () [0x7fef2594e830] bci=[-1,5,12] rc=0 vc=55 vn=- li=- udi=- nc=2 flg=0x20020
n1104n lload numIterations<parm 1 J>[#352 Parm] [flags 0x40000104 0x0 ] (cannotOverflow ) [0x7fef2594e8d0] bci=[-1,3,12] rc=1 vc=55 vn=- li=- udi=65 nc=0 flg=0x1000
n1099n lconst 0 (highWordZero X==0 X>=0 X<=0 ) [0x7fef2594e740] bci=[-1,0,12] rc=2 vc=55 vn=- li=- udi=- nc=0 flg=0x4302
n1105n BBEnd </block_109>
…
n5n BBStart <block_2> (freq 6) [0x7fef256de520] bci=[-1,1,12] rc=0 vc=2 vn=- li=-2 udi=- nc=0
n13n iflcmple --> block_5 BBStart at n1n () [0x7fef256de7a0] bci=[-1,5,12] rc=0 vc=2 vn=- li=-2 udi=- nc=2 flg=0x20020
n10n lload numIterations<parm 1 J>[#352 Parm] [flags 0x40000104 0x0 ] (cannotOverflow ) [0x7fef256de6b0] bci=[-1,3,12] rc=1 vc=2 vn=- li=-1 udi=85 nc=0 flg=0x1000
n7n lconst 0 (highWordZero X==0 X>=0 X<=0 ) [0x7fef256de5c0] bci=[-1,0,12] rc=2 vc=2 vn=- li=-2 udi=- nc=0 flg=0x4302
n4n BBEnd </block_2>
After GRA:
n638n BBStart <block_53> (freq 509) [0x7fef25795720] bci=[-1,0,12] rc=0 vc=2718 vn=- li=- udi=- nc=1
n4810n GlRegDeps [0x7fef26426f30] bci=[-1,0,12] rc=1 vc=2718 vn=- li=- udi=- nc=2
n4811n lRegLoad esi numIterations<parm 1 J>[#352 Parm] [flags 0x40000104 0x0 ] (cannotOverflow SeenRealReference ) [0x7fef26426f80] bci=[-1,3,12] rc=6 vc=2718 vn=- li=- udi=- nc=0 flg=0x9000
n4812n aRegLoad eax this<'this' parm Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;>[#351 Parm] [flags 0x40000107 0x0 ] [0x7fef26426fd0] bci=[-1,0,12] rc=5 vc=2718 vn=- li=- udi=- nc=0
n642n ificmpeq --> block_2 BBStart at n5n (profilingCode ) [0x7fef25795860] bci=[-1,0,12] rc=0 vc=2718 vn=- li=- udi=- nc=3 flg=0xa0
n640n iload 0x7fef247330cc[#479 Static] [flags 0x10303 0x8040 ] (cannotOverflow ) [0x7fef257957c0] bci=[-1,0,12] rc=1 vc=2718 vn=- li=- udi=- nc=0 flg=0x1000
n641n iconst -1 (X!=0 X<=0 ) [0x7fef25795810] bci=[-1,0,12] rc=1 vc=2718 vn=- li=- udi=- nc=0 flg=0x204
n4813n GlRegDeps [0x7fef26427020] bci=[-1,0,12] rc=1 vc=2718 vn=- li=- udi=- nc=2
n4811n ==>lRegLoad
n4812n ==>aRegLoad
n639n BBEnd </block_53>
…
n1097n BBStart <block_109> (freq 509) (extension of previous block) [0x7fef2594e6a0] bci=[-1,1,12] rc=0 vc=2718 vn=- li=- udi=- nc=0
n1102n iflcmple --> block_495 BBStart at n4815n () [0x7fef2594e830] bci=[-1,5,12] rc=0 vc=2718 vn=- li=- udi=- nc=3 flg=0x20020
n4811n ==>lRegLoad
n1099n lconst 0 (highWordZero X==0 X>=0 X<=0 ) [0x7fef2594e740] bci=[-1,0,12] rc=3 vc=2718 vn=- li=- udi=- nc=0 flg=0x4302
n4818n GlRegDeps [0x7fef264271b0] bci=[-1,5,12] rc=1 vc=2718 vn=- li=- udi=- nc=2
n4811n ==>lRegLoad
n4812n ==>aRegLoad
n1105n BBEnd </block_109>
…
n5n BBStart <block_2> (freq 6) [0x7fef256de520] bci=[-1,1,12] rc=0 vc=2718 vn=- li=- udi=- nc=1
n5265n GlRegDeps [0x7fef12707d70] bci=[-1,1,12] rc=1 vc=2718 vn=- li=- udi=- nc=2
n5266n lRegLoad esi numIterations<parm 1 J>[#352 Parm] [flags 0x40000104 0x0 ] (cannotOverflow SeenRealReference ) [0x7fef12707dc0] bci=[-1,3,12] rc=4 vc=2718 vn=- li=- udi=- nc=0 flg=0x9000
n5267n aRegLoad eax this<'this' parm Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;>[#351 Parm] [flags 0x40000107 0x0 ] [0x7fef12707e10] bci=[-1,1,12] rc=3 vc=2718 vn=- li=- udi=- nc=0
n13n iflcmple --> block_504 BBStart at n5268n () [0x7fef256de7a0] bci=[-1,5,12] rc=0 vc=2718 vn=- li=- udi=- nc=3 flg=0x20020
n5266n ==>lRegLoad
n7n lconst 0 (highWordZero X==0 X>=0 X<=0 ) [0x7fef256de5c0] bci=[-1,0,12] rc=3 vc=2718 vn=- li=- udi=- nc=0 flg=0x4302
n5271n GlRegDeps [0x7fef12707f50] bci=[-1,5,12] rc=1 vc=2718 vn=- li=- udi=- nc=2
n5266n ==>lRegLoad
n5267n ==>aRegLoad
n4n BBEnd </block_2>
n1104n lload
in block_109
is replaced with n4811n ==>lRegLoad
. lRegLoad
is long register load. The value is
now picked up from the register that is associated with n4811n lRegLoad
. Looking at the earlier references of n4811n
in block_53
, n4811n
takes the value parm 1
and associates it with the real register esi
. n4811n
is also the
first child of GlRegDeps which is the first child of BBStart
in block_53
. GlRegDeps
is a request to the code
generator that the value must be in the register esi
when coming into the block_53
.
The third child of n642n ificmpeq
is n4813n GlRegDeps
. When ificmpeq
branches to block_2
, the values that are
associated with n4811n lRegLoad
and n4812n aRegLoad
have to be in the register esi
and eax
. GlRegDeps
for the
BBStart
in block_2
loads from esi
and eax
. block_2
used to load parm 1
and compares it with 0
. After GRA,
it picks up the value from register esi
that is set up in block_53
.
Before GRA:
n4314n BBStart <block_448> (freq 5050) (extension of previous block) [0x7fef2641d430] bci=[-1,1,12] rc=0 vc=55 vn=- li=- udi=- nc=0
n4316n lstore i<auto slot 3>[#353 Auto] [flags 0x4 0x0 ] [0x7fef2641d4d0] bci=[-1,1,12] rc=0 vc=55 vn=- li=- udi=1 nc=1
n1099n ==>lconst 0
n4315n BBEnd </block_448> =====
After GRA:
n4314n BBStart <block_448> (freq 5050) (extension of previous block) [0x7fef2641d430] bci=[-1,1,12] rc=0 vc=2718 vn=- li=- udi=- nc=0
n4316n lRegStore edi [0x7fef2641d4d0] bci=[-1,1,12] rc=0 vc=2718 vn=- li=- udi=1 nc=1
n1099n ==>lconst 0
n4315n BBEnd </block_448> ===== [0x7fef2641d480] bci=[-1,1,12] rc=0 vc=2718 vn=- li=- udi=- nc=1
n4821n GlRegDeps () [0x7fef264272a0] bci=[-1,1,12] rc=1 vc=2718 vn=- li=- udi=- nc=3 flg=0x20
n4820n PassThrough rdi [0x7fef26427250] bci=[-1,0,12] rc=1 vc=2718 vn=- li=- udi=- nc=1
n1099n ==>lconst 0
n4811n ==>lRegLoad
n4812n ==>aRegLoad
n4576n BBStart <block_485> (freq 15050) (loop pre-header) [0x7fef26422610] bci=[-1,1,12] rc=0 vc=2718 vn=- li=- udi=- nc=1
n4822n GlRegDeps () [0x7fef264272f0] bci=[-1,1,12] rc=1 vc=2718 vn=- li=- udi=- nc=3 flg=0x20
n4823n lRegLoad edi i<auto slot 3>[#353 Auto] [flags 0x4 0x0 ] (SeenRealReference ) [0x7fef26427340] bci=[-1,29,12] rc=4 vc=2718 vn=- li=- udi=- nc=0 flg=0x8000
n4824n lRegLoad esi numIterations<parm 1 J>[#352 Parm] [flags 0x40000104 0x0 ] (SeenRealReference ) [0x7fef26427390] bci=[-1,3,12] rc=4 vc=2718 vn=- li=- udi=- nc=0 flg=0x8000
n4825n aRegLoad eax this<'this' parm Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;>[#351 Parm] [flags 0x40000107 0x0 ] (SeenRealReference sharedMemory ) [0x7fef264273e0] bci=[-1,8,13] rc=3 vc=2718 vn=- li=- udi=- nc=0 flg=0x8000
n4316n lstore
in block_448
used to take 0
and store i
to stack slot 3
. After GRA, it takes 0
and store it
into the register edi
. The value will be sent out of the block in rdi
(n4820n
). The next block that the block_485
falls into have three values in the registers. One of them associates value i
with edi
(n4823n
).
The IL trees after optimization phase are listed under title="Post Optimization Trees"
section. A Lower Trees pass,
detailed below, may transform certain IL tree patterns to facilitate more efficient instruction selection.
The final set of IL trees evaluated by Instruction Selection are listed under title="Pre Instruction Selection Trees"
section.
[ 496] O^O CODE GENERATION: Pwr of 2 mult opt node 00007FEF257618A0
Before Lower Trees:
n440n lsub (highWordZero X>=0 cannotOverflow ) [0x7fef25761940] bci=[8,2,448] rc=1 vc=1394 vn=- li=-1 udi=- nc=2 flg=0x5100
n438n lmul (X>=0 cannotOverflow ) [0x7fef257618a0] bci=[8,2,448] rc=1 vc=1394 vn=- li=-1 udi=- nc=2 flg=0x1100
n437n i2l (highWordZero X>=0 ) [0x7fef25761850] bci=[8,2,448] rc=1 vc=1394 vn=- li=-1 udi=- nc=1 flg=0x4100
n1204n ==>iRegLoad
n436n lconst 4 (highWordZero X!=0 X>=0 ) [0x7fef25761800] bci=[8,2,448] rc=1 vc=1394 vn=- li=-1 udi=- nc=0 flg=0x4104
n439n lconst -16 (X!=0 X<=0 ) [0x7fef257618f0] bci=[8,2,448] rc=1 vc=1394 vn=- li=-1 udi=- nc=0 flg=0x204
After Lower Trees:
n438n lshl (X>=0 cannotOverflow ) [0x7fef257618a0] bci=[8,2,448] rc=1 vc=1397 vn=- li=40 udi=- nc=2 flg=0x1100
n437n i2l (highWordZero X>=0 ) [0x7fef25761850] bci=[8,2,448] rc=1 vc=1397 vn=- li=40 udi=- nc=1 flg=0x4100
n1204n ==>iRegLoad
n436n iconst 2 (Unsigned X!=0 X>=0 ) [0x7fef25761800] bci=[8,2,448] rc=1 vc=1397 vn=- li=40 udi=- nc=0 flg=0x4104
n439n lconst -16 (X!=0 X<=0 ) [0x7fef257618f0] bci=[8,2,448] rc=1 vc=1397 vn=- li=40 udi=- nc=0 flg=0x204
The above is one example of what lowering trees does. There are all kinds of other tree lowering transformations
that make the task of generating code easier. In the above example, a constant multiplication represented by n438n lmul
is strength reduced to a n438n lshl
, as shifts generally execute in fewer cycles than multiply instructions by most CPUs.
Every shift can be expressed by a multiply, but not every multiply can be expressed into a shift. In Optimization,
we canonicalize all the shifts to multiplies to increase the chance of commoning nodes. If there are a shift by 2
and a multiply by 4
in the original program, we can convert the shift to multiply to share the common expression.
In Lowering Trees, multiply expressions are transformed into shifts as shown in the above example.
The first step of generating code is “Performing Instruction Selection”
. Architecture-specific instructions are
generated to evaluate the logic represented by the corresponding IL Tree.
There are three things essential to generate code for a specific platform.
- Instructions for the opcode
- Registers
- Encoding (the way the instructions encoded in the generated code cache)
============================================================
; Live regs: GPR=2 FPR=0 VRF=0 {&GPR_0033, GPR_0032}
------------------------------
n642n ( 0) ificmpeq --> block_2 BBStart at n5n () [0x7fef25795860] bci=[-1,0,12] rc=0 vc=4181 vn=- li=53 udi=- nc=3 flg=0x20
n640n ( 1) iload 0x7fef247330cc[#479 Static] [flags 0x10303 0x8040 ] (cannotOverflow ) [0x7fef257957c0] bci=[-1,0,12] rc=1 vc=4181 vn=- li=53 udi=- nc=0 flg=0x1000
n641n ( 1) iconst -1 (X!=0 X<=0 ) [0x7fef25795810] bci=[-1,0,12] rc=1 vc=4181 vn=- li=53 udi=- nc=0 flg=0x204
n4813n ( 1) GlRegDeps [0x7fef26427020] bci=[-1,0,12] rc=1 vc=4181 vn=- li=53 udi=- nc=2
n4811n ( 5) ==>lRegLoad (in GPR_0032) (cannotOverflow SeenRealReference )
n4812n ( 3) ==>aRegLoad (in &GPR_0033)
------------------------------
------------------------------
n642n ( 0) ificmpeq --> block_2 BBStart at n5n () [0x7fef25795860] bci=[-1,0,12] rc=0 vc=4181 vn=- li=53 udi=- nc=3 flg=0x20
n640n ( 0) iload 0x7fef247330cc[#479 Static] [flags 0x10303 0x8040 ] (cannotOverflow ) [0x7fef257957c0] bci=[-1,0,12] rc=0 vc=4181 vn=- li=53 udi=- nc=0 flg=0x1000
n641n ( 0) iconst -1 (X!=0 X<=0 ) [0x7fef25795810] bci=[-1,0,12] rc=0 vc=4181 vn=- li=53 udi=- nc=0 flg=0x204
n4813n ( 1) GlRegDeps [0x7fef26427020] bci=[-1,0,12] rc=1 vc=4181 vn=- li=53 udi=- nc=2
n4811n ( 4) ==>lRegLoad (in GPR_0032) (cannotOverflow SeenRealReference )
n4812n ( 2) ==>aRegLoad (in &GPR_0033)
------------------------------
[0x7fef12ad29f0] cmp dword ptr [$0x00007fef247330cc], 0xffffffffffffffff # CMP4MemImms, SymRef 0x7fef247330cc[#627 Static] [flags 0x10303 0x8040 ]
[0x7fef12ad2db0] assocreg # assocreg
POST: [&GPR_0033 : eax] [GPR_0032 : esi]
[0x7fef12ad2b40] je Label L0016 # JE4
PRE: [&GPR_0033 : eax] [GPR_0032 : esi]
POST: [&GPR_0033 : eax] [GPR_0032 : esi]
In the above example, the first IL trees are pre-evaluation, while the second IL trees are post-evaluation.
In the latter set, the IL nodes will be tagged with the corresponding virtual registers, if any.
The reference counts (rc
) of the various IL nodes will also be decremented as appropriate in the second IL trees.
The generated instruction at 0x7fef12ad29f0
puts a constant value into a memory reference and compares it with -1
.
$0x00007fef247330cc
is the address as encoded in the memory. Depending on the comparison answer, it will jump to
Label L0016
if it is equal. L0016
is associated with block_2
.
n635n ( 0) iadd (in GPR_0049) [0x7fef25795630] bci=[-1,0,12] rc=0 vc=4181 vn=- li=54 udi=13952 nc=2
n633n ( 0) iload 0x7fef247330f4[#476 Static] [flags 0x10303 0x4040 ] (in GPR_0049) (cannotOverflow ) [0x7fef25795590] bci=[-1,0,12] rc=0 vc=4181 vn=- li=54 udi=13952 nc=0 flg=0x1000
n634n ( 0) iload 0x7fef24733180[#477 Static] [flags 0x10303 0x4040 ] (cannotOverflow ) [0x7fef257955e0] bci=[-1,0,12] rc=0 vc=4181 vn=- li=54 udi=1 nc=0 flg=0x1000
[0x7fef12ad3700] mov GPR_0049, dword ptr [$0x00007fef247330f4] # L4RegMem, SymRef 0x7fef247330f4[#628 Static] [flags 0x10303 0x4040 ]
[0x7fef12ad38e0] add GPR_0049, dword ptr [$0x00007fef24733180] # ADD4RegMem, SymRef 0x7fef24733180[#629 Static] [flags 0x10303 0x4040 ]
The instruction at 0x7fef12ad3700
loads 4
byte memory (L4RegMem
) into the virtual register GPR_0049
. The instruction
at 0x7fef12ad38e0
loads from another address 0x00007fef24733180
and adds it to GPR_0049
.
GPR_0049
is a virtual register created by the compiler. The step of "Pre Instruction Selection Trees"
associates
nodes with virtual registers, which allows commoning to work. It pretends there are infinite number of virtual registers.
When a node is evaluated, it is checked where or not it is in a virtual register already. If it is, the value is picked
up from the virtual register. If it is not, the value is put into a virtual register.
============================================================
; Live regs: GPR=2 FPR=0 VRF=0 {&GPR_0033, GPR_0032}
------------------------------
n4316n ( 0) lRegStore edi [0x7fef2641d4d0] bci=[-1,1,12] rc=0 vc=4181 vn=- li=448 udi=- nc=1
n1099n ( 2) ==>lconst 0 (highWordZero X==0 X>=0 X<=0 )
------------------------------
------------------------------
n4316n ( 0) lRegStore edi [0x7fef2641d4d0] bci=[-1,1,12] rc=0 vc=4181 vn=- li=448 udi=- nc=1
n1099n ( 1) ==>lconst 0 (in GPR_0064) (highWordZero X==0 X>=0 X<=0 )
------------------------------
[0x7fef12ad5140] xor GPR_0064, GPR_0064 # XOR4RegReg
n1099n ( 2)
has two references. It is evaluated and put in GPR_0064
. We do XOR
to produce a zero in GPR_0064
.
============================================================
; Live regs: GPR=6 FPR=0 VRF=0 {&GPR_0561, &GPR_0560, GPR_0545, &GPR_0544, GPR_0528, GPR_0512}
------------------------------
n6563n ( 0) compressedRefs [0x7fef12731320] bci=[9,1,-] rc=0 vc=4181 vn=- li=479 udi=- nc=2
n1858n ( 4) l2a [0x7fef25c7d480] bci=[9,1,-] rc=4 vc=4181 vn=- li=479 udi=- nc=1
n6603n ( 1) iu2l [0x7fef12731fa0] bci=[9,1,-] rc=1 vc=4181 vn=- li=479 udi=- nc=1
n6602n ( 1) iloadi com/ibm/bumblebench/collections/ArrayListForEachLambdaBench$$Lambda$44/0x0000000000000000.arg$1 Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;[#410 final Shadow +8] [flags 0xa0607 0x0 ] [0x7fef12731f50] bci=[9,1,-] rc=1 vc=4181 vn=- li=479 udi=- nc=1
n4933n ( 3) ==>aRegLoad (in &GPR_0560) (X!=0 X>=0 SeenRealReference sharedMemory )
n6562n ( 1) lconst 0 [0x7fef127312d0] bci=[9,1,-] rc=1 vc=4181 vn=- li=479 udi=- nc=0
------------------------------
------------------------------
n6563n ( 0) compressedRefs [0x7fef12731320] bci=[9,1,-] rc=0 vc=4181 vn=- li=479 udi=- nc=2
n1858n ( 3) l2a (in &GPR_0562) [0x7fef25c7d480] bci=[9,1,-] rc=3 vc=4181 vn=- li=479 udi=42816 nc=1
n6603n ( 0) iu2l (in &GPR_0562) [0x7fef12731fa0] bci=[9,1,-] rc=0 vc=4181 vn=- li=479 udi=42816 nc=1
n6602n ( 0) iloadi com/ibm/bumblebench/collections/ArrayListForEachLambdaBench$$Lambda$44/0x0000000000000000.arg$1 Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;[#410 final Shadow +8] [flags 0xa0607 0x0 ] (in &GPR_0562) [0x7fef12731f50] bci=[9,1,-] rc=0 vc=4181 vn=- li=479 udi=42816 nc=1
n4933n ( 2) ==>aRegLoad (in &GPR_0560) (X!=0 X>=0 SeenRealReference sharedMemory )
n6562n ( 0) lconst 0 [0x7fef127312d0] bci=[9,1,-] rc=0 vc=4181 vn=- li=479 udi=- nc=0
------------------------------
[0x7fef12b0a7c0] mov &GPR_0562, dword ptr [&GPR_0560+0x8] # L4RegMem, SymRef com/ibm/bumblebench/collections/ArrayListForEachLambdaBench$$Lambda$44/0x0000000000000000.arg$1 Lcom/ibm/bumblebench/collections/ArrayListForEachLambdaBench;[#676 final Shadow +8] [flags 0xa0607 0x0 ]
============================================================
; Live regs: GPR=7 FPR=0 VRF=0 {&GPR_0562, &GPR_0561, &GPR_0560, GPR_0545, &GPR_0544, GPR_0528, GPR_0512}
------------------------------
n1862n ( 0) ifacmpeq --> block_193 BBStart at n1836n () [0x7fef25c7d5c0] bci=[9,1,-] rc=0 vc=4181 vn=- li=479 udi=- nc=3 flg=0x20
n1858n ( 3) ==>l2a (in &GPR_0562)
n1861n ( 1) aconst NULL (X==0 sharedMemory ) [0x7fef25c7d570] bci=[9,1,-] rc=1 vc=4181 vn=- li=479 udi=- nc=0 flg=0x2
n4935n ( 1) GlRegDeps () [0x7fef26429640] bci=[9,1,-] rc=1 vc=4181 vn=- li=479 udi=- nc=6 flg=0x20
n4929n ( 2) ==>iRegLoad (in GPR_0512) (SeenRealReference )
n4930n ( 4) ==>iRegLoad (in GPR_0528) (cannotOverflow SeenRealReference )
n4931n ( 2) ==>aRegLoad (in &GPR_0544) (SeenRealReference sharedMemory )
n4932n ( 2) ==>iRegLoad (in GPR_0545) (SeenRealReference )
n4933n ( 2) ==>aRegLoad (in &GPR_0560) (X!=0 X>=0 SeenRealReference sharedMemory )
n4934n ( 1) ==>aRegLoad (in &GPR_0561)
------------------------------
------------------------------
n1862n ( 0) ifacmpeq --> block_193 BBStart at n1836n () [0x7fef25c7d5c0] bci=[9,1,-] rc=0 vc=4181 vn=- li=479 udi=- nc=3 flg=0x20
n1858n ( 2) ==>l2a (in &GPR_0562) (X!=0 sharedMemory )
n1861n ( 0) aconst NULL (X==0 sharedMemory ) [0x7fef25c7d570] bci=[9,1,-] rc=0 vc=4181 vn=- li=479 udi=- nc=0 flg=0x2
n4935n ( 1) GlRegDeps () [0x7fef26429640] bci=[9,1,-] rc=1 vc=4181 vn=- li=479 udi=- nc=6 flg=0x20
n4929n ( 1) ==>iRegLoad (in GPR_0512) (SeenRealReference )
n4930n ( 3) ==>iRegLoad (in GPR_0528) (cannotOverflow SeenRealReference )
n4931n ( 1) ==>aRegLoad (in &GPR_0544) (SeenRealReference sharedMemory )
n4932n ( 1) ==>iRegLoad (in GPR_0545) (SeenRealReference )
n4933n ( 1) ==>aRegLoad (in &GPR_0560) (X!=0 X>=0 SeenRealReference sharedMemory )
n4934n ( 0) ==>aRegLoad (in &GPR_0561)
------------------------------
[0x7fef12b0abc0] test &GPR_0562, &GPR_0562 # TEST8RegReg
[0x7fef12b0b080] assocreg # assocreg
POST: [&GPR_0544 : ebx] [GPR_0512 : edi] [GPR_0528 : r8d] [GPR_0545 : r9d] [&GPR_0560 : r12d] [&GPR_0561 : r13d]
[0x7fef12b0ae10] je Label L0025 # JE4
PRE: [&GPR_0561 : r13d] [&GPR_0560 : r12d] [GPR_0545 : r9d] [&GPR_0544 : ebx] [GPR_0528 : r8d] [GPR_0512 : edi]
POST: [&GPR_0561 : r13d] [&GPR_0560 : r12d] [GPR_0545 : r9d] [&GPR_0544 : ebx] [GPR_0528 : r8d] [GPR_0512 : edi]
…
============================================================
; Live regs: GPR=5 FPR=0 VRF=0 {&GPR_0562, GPR_0545, &GPR_0544, GPR_0528, GPR_0512}
------------------------------
n4736n ( 0) astore <temp slot 8>[#522 Auto] [flags 0x4007 0x0 ] [0x7fef26425810] bci=[9,1,-] rc=0 vc=4181 vn=- li=356 udi=- nc=1
n1858n ( 2) ==>l2a (in &GPR_0562) (X!=0 sharedMemory )
------------------------------
------------------------------
n4736n ( 0) astore <temp slot 8>[#522 Auto] [flags 0x4007 0x0 ] [0x7fef26425810] bci=[9,1,-] rc=0 vc=4181 vn=- li=356 udi=- nc=1
n1858n ( 1) ==>l2a (in &GPR_0562) (X!=0 sharedMemory )
------------------------------
[0x7fef12b0bdd0] mov qword ptr [vfp], &GPR_0562 # S8MemReg, SymRef <temp slot 8>[#678 Auto] [flags 0x4007 0x0 ]
…
In the above example,
n1858n l2a
is evaluated and the value is stored in&GPR_0562
in the instruction at0x7fef12b0a7c0
.GPR_0562
is used directly for theNULL
test inn1862n ifacmpeq
in the instruction at0x7fef12b0abc0
.GPR_0562
is used again inn4736n astore
to store the value to stack temp slot8
in the instruction at0x7fef12b0bdd0
.vfp
stands for Virtual Frame Pointer. It is used because we have not mapped stack slots. We don’t have offset associated with each auto slot yet.
&
means the virtual register is a collected reference. A reference typed node in the IL must be marked as
"collected" if it could result in a value that is on the Java heap. It implies that the GC needs to be notified
about it if the value is live across a GC point. Note that collected reference may be held live within a single
BB and yet be live across a GC point (since a single block can contain any number of GC points).
When the "Collected References" in the IL are associated with a virtual register or stack slot in the codegen, they are in turn considered collected. This information eventually gets used to emit GC stack and register maps at the GC points notifying the GC about live objects as well as reference slots/registers needing to be updated if the object is moved.
The following are some examples of transformations that happen after Instruction Selection.
For example for the instruction at 0x7fef12b0bdd0
from the above example:
[0x7fef12b0bdd0] mov qword ptr [vfp], rcx # S8MemReg, SymRef <temp slot 8>[#678 Auto] [flags 0x4007 0x0 ]
"Post Register Assignment Instructions"
: GPR_0562
is assigned with the real register rcx
in
the instruction at 0x7fef12b0bdd0
.
[0x7fef12b0bdd0] mov qword ptr [vfp-0x80], rcx # S8MemReg, SymRef <temp slot 8>[#678 Auto] [flags 0x4007 0x0 ]
"Post Stack Map"
: Stack slot temp 8
is mapped to offset -0x80
in the instruction at 0x7fef12b0bdd0
.
[0x7fef12b0bdd0] mov qword ptr [rsp+0xd8], rcx # S8MemReg, SymRef <temp slot 8>[#678 Auto +216] [flags 0x4007 0x0 ]
"VFP Substitution"
: The instruction at 0x7fef12b0bdd0
populates the stack pointer rsp
. This is the instruction
that looks more or less what is going into the code cache.
0x7fef13a74929 00000229 [0x7fef12b0bdd0] 48 89 8c 24 d8 00 00 00 mov qword ptr [rsp+0xd8], rcx # S8MemReg, SymRef <temp slot 8>[#678 Auto +216] [flags 0x4007 0x0 ]
"Post Binary Instructions"
: The final binary encoding at 0x7fef13a74929
.
; Live regs: GPR=3 FPR=0 VRF=0 {&GPR_1441, &GPR_1440, GPR_1424}
------------------------------
n430n ( 0) treetop () [0x7fef25791620] bci=[6,43,1541] rc=0 vc=4181 vn=- li=34 udi=- nc=1 flg=0x8
n349n ( 12) acall java/util/ArrayList.elementAt([Ljava/lang/Object;I)Ljava/lang/Object;[#430 static Method] [flags 0x500 0x0 ] (X>=0 virtualCallNodeForAGuardedInlinedCall sharedMemory ) [0x7fef2578fcd0] bci=[6,43,1541] rc=12 vc=4181 vn=- li=34 udi=- nc=2 flg=0x908
n5121n ( 8) ==>aRegLoad (in &GPR_1441) (X!=0 SeenRealReference sharedMemory )
n5119n ( 8) ==>iRegLoad (in GPR_1424) (cannotOverflow SeenRealReference )
------------------------------
------------------------------
n430n ( 0) treetop () [0x7fef25791620] bci=[6,43,1541] rc=0 vc=4181 vn=- li=34 udi=- nc=1 flg=0x8
n349n ( 11) acall java/util/ArrayList.elementAt([Ljava/lang/Object;I)Ljava/lang/Object;[#430 static Method] [flags 0x500 0x0 ] (in &GPR_1469) (X>=0 virtualCallNodeForAGuardedInlinedCall sharedMemory ) [0x7fef2578fcd0] bci=[6,43,1541] rc=11 vc=4181 vn=- li=34 udi=64144 nc=2 flg=0x908
n5121n ( 7) ==>aRegLoad (in &GPR_1441) (X!=0 SeenRealReference sharedMemory )
n5119n ( 7) ==>iRegLoad (in GPR_1424) (cannotOverflow SeenRealReference )
------------------------------
[0x7fef12b4f170] mov &GPR_1456, &GPR_1441 # MOV8RegReg
[0x7fef12b4f280] mov GPR_1457, GPR_1424 # MOV4RegReg
[0x7fef12b4fb10] Label L0768: # label # (Start of internal control flow)
PRE: [&GPR_1456 : eax] [GPR_1457 : esi]
[0x7fef12b4f470] nop # Align patchable code @32 [0x0:5]
[0x7fef12b4f3d0] call java/util/ArrayList.elementAt([Ljava/lang/Object;I)Ljava/lang/Object;# CALLImm4 (00007FEF13A6CF89)# CALLImm4
[0x7fef12b4fe20] assocreg # assocreg
POST: [GPR_1424 : r8d] [&GPR_1440 : r12d] [&GPR_1441 : r13d]
[0x7fef12b4fbb0] Label L0769: # label # (End of internal control flow)
PRE: [&GPR_1456 : eax] [GPR_1457 : esi]
POST: [D_GPR_1458 : ecx] [D_GPR_1459 : edx] [D_GPR_1460 : edi] [D_GPR_1461 : esi] [D_GPR_1462 : r8d] [D_GPR_1463 : r10d] [D_GPR_1464 : r11d] [D_GPR_1465 : r12d] [D_GPR_1466 : r13d] [D_GPR_1467 : r14d] [D_GPR_1468 : r15d] [GPR_0000 : ebp] [&GPR_1469 : eax]
The two parameters for call java/util/ArrayList.elementAt([Ljava/lang/Object;I)Ljava/lang/Object;
are passed from
GPR_1441
to GPR_1456
, and GPR_1424
to GPR_1457
.
PRE: [&GPR_1456 : eax] [GPR_1457 : esi]
Precondition: GPR_1456
is associated with real register eax
; GPR_1457
is associated with real register esi
because of the linkage convention. The linkage convention on the platform requires the first argument should be in eax
and the second argument should be esi
. It tells register allocator to make sure the values are in the right real registers.
POST: [GPR_1424 : r8d] [&GPR_1440 : r12d] [&GPR_1441 : r13d]
Postcondition: Associates the virtual registers with the right real registers for return values or other dependencies that need to be obeyed.
============================================================
; Live regs: GPR=9 FPR=0 VRF=0 {&GPR_5073, &GPR_5072, GPR_5056, GPR_5043, &GPR_5042, &GPR_5041, GPR_5040, GPR_5025, &GPR_5024}
------------------------------
n2032n ( 0) checkcast [#86] [0x7fef25c80ae0] bci=[15,5,-] rc=0 vc=4181 vn=- li=217 udi=- nc=2
n5851n ( 2) ==>aRegLoad (in &GPR_5024) (SeenRealReference sharedMemory )
n2034n ( 1) loadaddr java/lang/Integer[#501 Static] [flags 0x18307 0x0 ] [0x7fef25c80b80] bci=[15,5,-] rc=1 vc=4181 vn=- li=217 udi=- nc=0
------------------------------
------------------------------
n2032n ( 0) checkcast [#86] [0x7fef25c80ae0] bci=[15,5,-] rc=0 vc=4181 vn=- li=217 udi=- nc=2
n5851n ( 1) ==>aRegLoad (in &GPR_5024) (SeenRealReference sharedMemory )
n2034n ( 0) loadaddr java/lang/Integer[#501 Static] [flags 0x18307 0x0 ] [0x7fef25c80b80] bci=[15,5,-] rc=0 vc=4181 vn=- li=217 udi=- nc=0
------------------------------
[0x7fef12c4e440] mov GPR_5088, &GPR_5024 # MOV8RegReg
[0x7fef12c4e4d0] Label L2064: # label # (Start of internal control flow)
[0x7fef12c4e570] test GPR_5088, GPR_5088 # TEST8RegReg
[0x7fef12c4e600] je Label L2065 # JE4
[0x7fef12c4e730] mov GPR_5088, dword ptr [GPR_5088] # L4RegMem
[0x7fef12c4e7c0] and GPR_5088, 0xffffffffffffff00 # AND4RegImm4
[0x7fef12c4e850] cmp GPR_5088, 0x000a4100 # CMP4RegImm4
[0x7fef12c4e8e0] jne Outlined Label L2066 # JNE4
[0x7fef12c4efc0] assocreg # assocreg
POST: [&GPR_5041 : eax] [&GPR_5042 : ebx] [&GPR_5024 : edi] [GPR_5040 : esi] [GPR_5025 : r8d] [GPR_5043 : r9d] [GPR_5056 : r10d] [&GPR_5072 : r11d] [&GPR_5073 : r13d]
[0x7fef12c4ed50] Label L2065: # label # (End of internal control flow)
PRE: [GPR_5089 : NoReg] [GPR_5088 : NoReg]
POST: [GPR_5089 : NoReg] [GPR_5088 : NoReg]
The above is an example of the generated code for n2032n checkcast
node. JIT optimizes the common case when the
checkcast exception will not be thrown.
Instructions at 0x7fef12c4e570
and 0x7fef12c4e600
check if the value is NULL
. If it is not NULL
, it will get the
class pointer and mask out the lower 8
bits and does a comparison to the class (0x000a4100
) that is casting to. If
it is equal, the checkcast succeeds. If it is not equal, it fails and throw the exception. Throwing the exception is
not handled here. We jump out of mainline to the out-of-line Label L2066
. We do not pollute the code in the
mainline which is executed more often.
------------ start out-of-line instructions
[0x7fef12c4e9f0] Outlined Label L2066: # label
[0x7fef12c4ea90] push GPR_5088 # PUSHReg
[0x7fef12c4eb20] push 0x000a4100 # PUSHImm4
[0x7fef12c4ebb0] call jitThrowClassCastException# CALLImm4 (00007FEF27D989D0)# CALLImm4
[0x7fef12c4ecb0] Label L2067: # label
------------ end out-of-line instructions
The out-of-line code is at the bottom of the code. It pushes the arguments that are required for the VM helper call
jitThrowClassCastException
. The VM helper call takes care of throwing the exception.
[0x7fef12bbd320] Label L0052: # label
[0x7fef12ced790] mov qword ptr [vfp], rsi # S8MemReg, #SPILL8, SymRef <#SPILL8_1133 0x7fef12cac3f0>[#1133 Auto] [flags 0x80000000 0x0 ]
[0x7fef12ced630] mov qword ptr [vfp], rax # S8MemReg, #SPILL8, SymRef <#SPILL8_1132 0x7fef12cabf70>[#1132 Auto] [flags 0x80000000 0x0 ]
[0x7fef12ced4d0] mov qword ptr [vfp], r10 # S8MemReg, #SPILL8, SymRef <#SPILL8_1131 0x7fef12cac870>[#1131 Auto] [flags 0x80000000 0x0 ]
[0x7fef12ced370] mov qword ptr [vfp], r12 # S8MemReg, #SPILL8, SymRef <#SPILL8_1130 0x7fef12ca8980>[#1130 Auto] [flags 0x80000000 0x0 ]
========================================
[0x7fef12bbd610] fence Relative [ 00007FEF25C302C0 ] # fence BBStart <block_166> (frequency 1) (cold) (in loop 176)
[0x7fef12bbe060] inc dword ptr [$0x00007fef24733168] # INC4Mem, SymRef 0x7fef24733168[#866 Static] [flags 0x10303 0x4040 ]
[0x7fef12bbeaf0] mov rax, r13 # MOV8RegReg
[0x7fef12ced1d0] mov qword ptr [vfp], r13 # S8MemReg, #SPILL8, SymRef <#SPILL8_1129 0x7fef12cac630>[#1129 Auto] [flags 0x80000000 0x0 ]
[0x7fef12bbec00] mov esi, r8d # MOV4RegReg
[0x7fef12ced070] mov qword ptr [vfp], r8 # S8MemReg, #SPILL8, SymRef <#SPILL8_1128 0x7fef12cac1b0>[#1128 Auto] [flags 0x80000000 0x0 ]
[0x7fef12bbf490] Label L1440: # label # (Start of internal control flow)
[0x7fef12bbedf0] nop # Align patchable code @32 [0x0:5]
[0x7fef12bbed50] call java/util/ArrayList.elementAt([Ljava/lang/Object;I)Ljava/lang/Object;# CALLImm4 (00007FEF13A6CF89)# CALLImm4
[0x7fef12bbf7a0] assocreg # assocreg
[0x7fef12bbf530] Label L1441: # label # (End of internal control flow)
[0x7fef12cece70] mov r13, qword ptr [vfp] # L8RegMem, #SPILL8, SymRef <#SPILL8_1127 0x7fef12cac630>[#1127 Auto] [flags 0x80000000 0x0 ]#, spilled for acall
[0x7fef12cecd10] mov r12, qword ptr [vfp] # L8RegMem, #SPILL8, SymRef <#SPILL8_1126 0x7fef12ca8980>[#1126 Auto] [flags 0x80000000 0x0 ]#, spilled for acall
[0x7fef12cecbb0] mov r10, qword ptr [vfp] # L8RegMem, #SPILL8, SymRef <#SPILL8_1125 0x7fef12cac870>[#1125 Auto] [flags 0x80000000 0x0 ]#, spilled for acall
[0x7fef12ceca50] mov r8, qword ptr [vfp] # L8RegMem, #SPILL8, SymRef <#SPILL8_1124 0x7fef12cac1b0>[#1124 Auto] [flags 0x80000000 0x0 ]#, spilled for acall
[0x7fef12cec8f0] mov rsi, qword ptr [vfp] # L8RegMem, #SPILL8, SymRef <#SPILL8_1123 0x7fef12cac3f0>[#1123 Auto] [flags 0x80000000 0x0 ]#, spilled for acall
[0x7fef12cec7d0] mov rdi, rax # MOV8RegReg
[0x7fef12cec700] mov rax, qword ptr [vfp] # L8RegMem, #SPILL8, SymRef <#SPILL8_1122 0x7fef12cabf70>[#1122 Auto] [flags 0x80000000 0x0 ]#, spilled for acall
Register spill around a call at 0x7fef12bbed50
. The spills are added by the register assigner. Sometimes it cannot
satisfy all the values that need to be kept alive at a particular program point given the constraints of an architecture.
Sometimes there are more virtual registers than the number of real registers that are available. It will spill some of
the values into the memory stack slots. Once in a better state, after the call is done, the values are restored from
the memory into the register. Certain registers are set aside as being preserved before the call.
One of the things on debugging register allocator problem is to look at if there is a lot of spilling. If there is, exam closely and see if there is an issue of save and restore registers. One possibility could be that the restore is not done correctly on one of the paths, for example missing a restore, or restoring a wrong register, or a wrong value is used.
+--------------------------------------- instruction address
| +----------------------------------------- instruction offset from start of method
| | +------------------------------------------ corresponding TR::Instruction instance
| | | +-------------------------------------------------- code bytes
| | | | +-------------------------------------- opcode and operands
| | | | | +----------- additional information
| | | | | |
V V V V V V
0x7fef13a746d0 ffffffd0 [0x7fef12ad0bf0] Label L0084: # label
0x7fef13a746d0 ffffffd0 [0x7fef12ad0d80] 48 bf 78 2c 34 00 00 00 00 00 mov rdi, 0x0000000000342c78 # MOV8RegImm64
0x7fef13a74700 00000000 [0x7fef12d881e0] 48 8b 44 24 18 mov rax, qword ptr [rsp+0x18] # L8RegMem, SymRef [#1164 +24]
"Post Binary Instructions"
is the last thing in the JIT compilation.
0x7fef13a74700
: the code cache address or instruction address where the bytecode instruction is.
00000000
: instruction offset from start of method.
0x7fef12ad1d50
: TR instruction address.
48 8b 44 24 18
: bytes that represents the instruction.
The codegen also generates prologue and epilogue. Linkage conventions may require managements of incoming parameters and return values, adjustments of stack pointer to allocate/deallocate a call frame, saving/restoring preserved register values, checks for stack overflow, etc.
We keep track of every point that could result in GC. Every local variable is searchable. All local variables have GC map index. The ones that are collected references (collected autos, collected register, and collected parms) have positive or non-negative GC map index.
Internal stack atlas:
numberOfMaps=92
numberOfSlotsMapped=23
numberOfParmSlots=1
parmBaseOffset=24
localBaseOffset=-176
Locals information :
Local [0x7fef256df590] (GC map index : 0, Offset : 24, Size : 8) is an uninitialized collected parm
Local [0x7fef256df7c0] (GC map index : -1, Offset : 8, Size : 8) is an uninitialized uncollected parm
Local [0x7fef12cac870] (GC map index : 17, Offset : -48, Size : 8) is an uninitialized uncollected auto
Local [0x7fef12cac630] (GC map index : 16, Offset : -56, Size : 8) is an uninitialized uncollected auto
Local [0x7fef12cac3f0] (GC map index : 20, Offset : -24, Size : 8) is an uninitialized uncollected auto
…
Local [0x7fef262476b0] (GC map index : -1, Offset : -288, Size : 4) is an uninitialized uncollected auto
GC maps describe all live autos/parms/registers that contain potential heap object references at the given point in the program. GC maps are associated with instructions, or more correctly, ranges of instructions. When a GC occurs, the address of the current instruction (i.e. the address in the program counter register) is used to search for the appropriate map in the GC atlas of the method. The addresses of the heap object references are fixed up one by one by calling a GC routine that computes the displacement to apply for each address.
Map number : 2
Code offset range starts at [00001885]
GC stack map information :
number of stack slots mapped = 21
live stack slots containing addresses --> {0,5}
GC register map information :
slot pushes: 0 registers: {eax }
Different GC maps are generated for different GC points. In the above example, for GC Map 2
, there are 21
collected autos.
Out of the 21
autos, stack slot index 0
and 5
are live at this point. Register eax
holds collected references.
At the runtime, the collected references are referred as O-Slots in Stack Walker output. During a stack walk, the stack slot
indices are used to map to the specific stack slots that will be checked for collected references.