Skip to content

Latest commit

 

History

History
230 lines (172 loc) · 7.48 KB

CompactImageMapFormat.md

File metadata and controls

230 lines (172 loc) · 7.48 KB

Compact ImageMap Format

A process' address space contains (among other things) the set of dynamically loaded images that have been mapped into that address space. When generating crash logs or symbolicating backtraces, we need to be able to capture and potentially store the list of images that has been loaded, as well as some of the attributes of those images, including each image's

  • Path
  • Build ID (aka UUID)
  • Base address
  • End-of-text address

Compact ImageMap Format (CIF) is a binary format for holding this information.

General Format

Compact ImageMap Format data is byte aligned and starts with an information byte:

   7   6   5   4   3   2   1   0
 ┌───────────────────────┬───────┐
 │ version               │ size  │
 └───────────────────────┴───────┘

The version field identifies the version of CIF that is in use; this document describes version 0. The size field is encoded as follows:

size Machine word size
00 16-bit
01 32-bit
10 64-bit
11 Reserved

This is followed immediately by a field containing the name of the platform that generated this image map. This field consists of a single byte length followed by a UTF-8 string of that length.

After that is a field encoding the number of images in the image map; this field is encoded as a sequence of bytes, each holding seven bits of data, with the top bit clear for the final byte. The most significant byte is the first. e.g.

count Encoding
0 00
1 01
127 7f
128 81 00
129 81 01
700 85 3c
1234 89 52
16384 81 80 00
65535 83 ff 7f
2097152 81 80 80 00

This in turn is followed by the list of images, stored in order of increasing base address. For each image, we start with a header byte:

   7   6   5   4   3   2   1   0
 ┌───┬───┬───────────┬───────────┐
 │ r │ 0 │ acount    │ ecount    │
 └───┴───┴───────────┴───────────┘

If r is set, then the base address is understood to be relative to the previously computed base address.

This byte is followed by acount + 1 bytes of base address, then ecount + 1 bytes of offset to the end of text.

Following this is an encoded count of bytes in the build ID, encoded using the 7-bit scheme we used to encode the image count, and then after that come the build ID bytes themselves.

Finally, we encode the path string using the scheme below.

String Encoding

Image paths contain a good deal of redundancy; paths are therefore encoded using a prefix compression scheme. The basic idea here is that while generating or reading the data, we maintain a mapping from small integers to path prefix segments.

The mapping is initialised with the following fixed list that never need to be stored in CIF data:

code Path prefix
0 /lib
1 /usr/lib
2 /usr/local/lib
3 /opt/lib
4 /System/Library/Frameworks
5 /System/Library/PrivateFrameworks
6 /System/iOSSupport
7 /Library/Frameworks
8 /System/Applications
9 /Applications
10 C:\Windows\System32
11 C:\Program Files

Codes below 32 are reserved for future expansion of the fixed list.

Strings are encoded as a sequence of bytes, as follows:

opcode Mnemonic Meaning
00000000 end Marks the end of the string
00xxxxxx str Raw string data
01xxxxxx framewk Names a framework
1exxxxxx expand Identifies a prefix in the table

end

Encoding
   7   6   5   4   3   2   1   0
 ┌───────────────────────────────┐
 │ 0   0   0   0   0   0   0   0 │  end
 └───────────────────────────────┘

Meaning

Marks the end of the string

str

Encoding
   7   6   5   4   3   2   1   0
 ┌───────┬───────────────────────┐
 │ 0   0 │ count                 │  str
 └───────┴───────────────────────┘
Meaning

The next count bytes are included in the string verbatim. Additionally, all path prefixes of this string data will be added to the current prefix table. For instance, if the string data is /swift/linux/x86_64/libfoo.so, then the prefix /swift will be assigned the next available code, /swift/linux the code after that, and /swift/linux/x86_64 the code following that one.

framewk

Encoding
   7   6   5   4   3   2   1   0
 ┌───────┬───────────────────────┐
 │ 0   1 │ count                 │  framewk
 └───────┴───────────────────────┘
Meaning

The next byte is a version character (normally A, but some frameworks use higher characters), after which there are count + 1 bytes of name.

This is expanded using the pattern /<name>.framework/Versions/<version>/<name>. This also marks the end of the string.

expand

Encoding
   7   6   5   4   3   2   1   0
 ┌───┬───┬───────────────────────┐
 │ 1 │ e │ code                  │  expand
 └───┴───┴───────────────────────┘
Meaning

If e is 0, code is the index into the prefix table for the prefix that should be appended to the string at this point.

If e is 1, this opcode is followed by code + 1 bytes that give a value v such that v + 64 is the index into the prefix table for the prefix that should be appended to the string at this point.

Example

Let's say we wish to encode the following strings:

/System/Library/Frameworks/AppKit.framework/Versions/C/AppKit
/System/Library/Frameworks/Photos.framework/Versions/A/Photos
/usr/lib/libobjc.A.dylib
/usr/lib/libz.1.dylib
/usr/lib/swift/libswiftCore.dylib
/usr/lib/libSystem.B.dylib
/usr/lib/libc++.1.dylib

We would encode

<84> <45> CAppKit <00>

We then follow with

<84> <45> APhotos <00>

Next we have

<81> <10> /libobjc.A.dylib <00>
<81> <0d> /libz.1.dylib <00>
<81> <19> /swift/libswiftCore.dylib <00>

assigning code 32 to /swift, then

<81> <12> /libSystem.B.dylib <00>
<81> <0f> /libc++.1.dylib <00>

In total the original data would have taken up 256 bytes. Instead, we have used 122 bytes, a saving of over 50%.