This document explains and documents the concepts and implementation of symbols, symbol references, and aliasing in the OMR compiler.
The original, conceptual distinction between symbols and symbol references (symrefs) was actually quite straightforward. However, over time, implementation details have led to the definition of the two to become very blurry and somewhat unspecified. This document therefore offers the original definitions, as well as some implementation details and subtleties.
A symbol represents a memory location. It holds the minimum information required to directly translate a method's source code or IL representation, without consideration for optimization. It stores basic information (e.g. data type) needed to determine what memory access operations can/must be done.
Different kinds of symbols exist to represent the different kinds of memory access requirements. Each kind of symbol stores information that is specific to its kind (type). The different symbol kinds are:
- static
- represents a static memory address
- contains the address of the memory location
- auto
- represents a variable mapped on the stack
- may come from local method variables or compiler allocated stack objects
- contains the stack offset of the variable
- represents a variable mapped on the stack
- method
- represents a method (static, virtual)
- resolved method
- parameter
- register mapped
- label
- shadow
- represents indirection off a base address
- e.g. accessing a class field using the class base address plus an offset
- contains offset from some base address
- represents indirection off a base address
An int shadow represents the common case of an integer type. In the case of Java, for example, an int shadow is created for a field or array access.
A generic int shadow is created by the optimizer in some (relatively rare) cases to refer to a field that wasn't referred to in the bytecodes/source. If the optimizer wants to synthesize a load or store, but doesn't have a symbol for it in the original program, it can create one and call it a generic int shadow.
For example, an escape analysis optimization might want to generate a store to a location within an object that hasn't been read in this method in order to initialize the location to zero.
A Generic int shadow has different aliasing depending on the context where it is created (i.e. it could alias different kinds of other shadows).
A symbol reference is (or is supposed to be, rather) a wrapper around the different kinds of symbols. It encapsulates the additional information required by an optimizing compiler, including:
- aliasing
- owning method (method who's IL created the symbol, which may be different from the current method because of inlining)
All symrefs for a given compilation are stored in the "symbol reference table" (symreftab).
Symrefs are numbered, and sets of symrefs are almost always represented as bit vectors where each bit specifies whether a corresponding symref is a member of the set or not.
Aliasing information is associated with symrefs. However, the structural definition of aliasing used in the OMR compiler differs from the classical definition of aliasing. Specifically, it is also used as an indicator of side-effects, rather than to mean that multiple names are used to access the same memory. For example, a method symref may alias a static symref, signifying that a call to the method kills the value of the static.
At a high level, aliasing information is created by calling
TR::AliasBuilder::createAliasInfo
. The optimizer infrastructure invokes createAliasInfo
if alias sets have not been built yet or alias sets have been invalidated by a previous
optimization pass. An optimization pass will invalidate alias sets if it is creating a
symbol reference, i.e. Inliner, Partial Redundancy Elimination (PRE),
Escape Analysis (EA), etc. Generally, alias information is not maintained
by each optimization, and is meant to be rebuilt before a subsequent optimization that
needs it. An optimization will make queries on the alias sets and decide whether a specific
transformation should be performed during its pass.
This alternative definition of aliasing permits aliasing relationships to be asymmetric. To illustrate using the above example, although a method may alias a static, the static does not alias the method because killing the value of the static does not affect the method itself (though it may affect the result returned by a call to the method).
At the lowest level, aliasing relations are represented using bit vectors. The bit vector represents a set of symrefs that alias a particular symref.
The two main APIs for getting a set of aliased symrefs are
TR::SymbolReference::getUseDefAliasesBV()
and TR::SymbolReference::getUseonlyAliasesBV
.
When called on a particular symref, they return a bit vector specifying the
aliased symrefs.
The bit vector returned is constructed from various other, more specific bit vectors.
Each of these specifies the symrefs that alias a symref matching certain queries.
Mostly, these queries have to do with the kind of symbol wrapped by the symref. The
associated bit vectors themselves are stored in the TR::AliasBuilder
class. For a
given symref, each query is successively performed. Every time a query returns a
positive result, the associated bit vector is "ORed" into the final alias bit vector.
In particular, in the case of Java, if TR::SymbolReference::getUseDefAliasesBV()
is
invoked on a shadow or static symref, it is likely that the program is doing a resolved
access store so aliasing information is required. In other words, if it is storing into
a shadow or static symef, the question is what other symbol references are potentially
overwritten by the store. Please note that any unresolved accesses (loads or stores)
can have non-zero alias sets because the resolution process may load a class and call
clinit, which means the load or store is treated as a method call.
The TR::AliasBuilder
class is responsible for storing the bit vectors of all symrefs
that alias a symref matching a given query. When a new symref is created, its corresponding
bit must be set in all the bit vectors associated with queries that match the new symref.
These bit vectors are only present for primitive queries. Bit vectors for complex queries
are constructed by TR:AliasBuilder
by "ORing" together the various bit vectors
associated with queries that compose the larger query.
The TR_AliasSetInterface
class is a wrapper around a bit vector representing aliased
symrefs. It provides various methods for performing common operations on these bit vectors.
The non-type template parameter is one of the two values of
the AliasSetInterface
enum: UseDefAliasSet
or UseOnlyAliasSet
.
One key interface method is getTRAliases()
. This template specialized method is responsible for
returning the bit vector of aliases.
The class can also be initialized from a TR::Node
instance through Node::mayKill()
or
Node::mayUse()
method. It provides an API for dealing with aliases of the symref contained in
the node. If theTR::Node
instance calls mayKill()
or mayUse()
method with NULL symref
, it
will initialize an empty AliasSetInterface
and getTRAliases()
will return a NULL pointer.
It is worth noting that some of these operations are templated. So, theoretically, they
can be used on bit vector types different from TR_BitVector
. However, the type is
required to satisfy the (unspecified) interface of TR_BitVector
.
Another noteworthy point is that the class has 2 constructors: one with a bool shares_symbol
argument and one without it. The latter one checks whether the _symbolReference
can have a valid alias
through method sharesSymbol()
and then put the result into _shares_symbol
while the former one simply
sets sharesSymbol()
= true
.
Because most of the bit vectors present in the TR::AliasBuilder
class were originally
created for Java, discussion in this section will mostly be in the context of Java.
In Java, a callee cannot overwrite an auto or parameter, so aliasing is fairly trivial for autos and parameters. Since Java provides type guarantees, for instance integers cannot be aliased to floats, we group alias bit vectors by their types. In addition, we group them by what kind of symbol references they point to, i.e. statics vs. shadows. Therefore, the alias bit vectors can be grouped by the type categories and by the storage categories. Within each grouping, the alias bit vectors will never intersect. We will discuss a few specific kinds of alias bit vectors below.
Generic int shadow bit vector is a relatively simple concept. It is for accessing a field of an object but no symbol reference has been made for it. Some optimizations such as escape analysis and new initialization operate under this scenario. This alias bit vector calculation is very conservative, i.e. aliasing to everything. Because generic int shadow is not desirable but needed, we created generic int array and nonarray shadow bit vectors to distinguish between array and nonarray if we know enough about the base class. These two bit vectors are mutually exclusive, i.e. if a symbol reference is in the generic int array shadow bit vector, it will not be in the generic int nonarray shadow bit vector, but will be included in the generic int shadow bit vector. Generic int shadow bit vector will include both generic int array and nonarray shadow bit vectors.
Unsafe bit vector is created for sun.misc.unsafe APIs to fast pass random memory accesses and accesses to a field off an object given the base address of the object. The unsafe bit vector calculation is fairly conservative, it aliases everything that is a shadow or static.
gcSafePoint bit vector is created to handle balanced GC mode or real-time mode. During these modes, we have pointers to arraylets (with no headers). Marking these pointers as collected or not collected won't work, i.e. collected will cause a crash in GC with no header information and not collected means we might get garbage after GC. The gcSafePoint bit vector is conservative and holds symbol references that need to be killed at GC points, for instance at async checks, or allocations, or calls. This results in less commoning across GC points.
Java has immutable types, i.e. integer, short, byte, long, float, double and string. To change their values, you will have to create new ones except in their constructor methods. We don't want to include these symbol references across calls such as StringBuilder.append(), allowing commoning across the call in this case. We can look ahead in other classes and see if they are immutable under higher opt level.
catchLocalUse bit vector contains use-only aliases related to exceptions in general. It does a reachability analysis from all catch blocks and mark all local variable symbol references that can be used. If there is a store before a load for the same symbol reference, it will not mark the symbol reference for the load. Shadows and statics are not considered in the catchLocalUse bit vector calculation.