TEAM BLOG
It’s the little things.
A consistent and efficient mechanism to encode arbitrary 64-bit integers in a byte stream
xtclang
language-design
design principles
XVM Integer Packing
XIP
Authored by: Cameron Purdy
Last edited: @August 15, 2021
Ecstasy compiles to an Intermediate Representation (IR), which is a binary format intended to carry a rich set of information that can be used by the Ecstasy adaptive runtime compiler back-end to generate the actual native code that will run on a machine's CPU.
In the language and compiler field, Intermediate Representation is also know as: Intermediate Language (IL), byte code, bit code, p-code, and several other names.
Ecstasy IR was designed to be extremely tight (small), but it was also designed to support extremely large projects. To that end, all of the IR nodes and offsets (a.k.a. addresses) use 64-bit values; for example, the body of function can be 2^63 bytes long, and can reference up to 2^63 predefined constant values, and so on. All of these values, sizes, offsets, indexes, addresses, etc. are encoded as 64-bit values
These two requirements are in natural conflict. The first says "keep it small", while the latter says "make it large". And this problem applies to many things besides addresses: register numbers, parameter counts, references into the constant pool, and so on. We recognized that solving this conflict in a uniform and efficient manner, and doing so up front, would be critical in keeping the design simple and consistent.
Despite needing to support 64-bit integer values, it turns out that the most common integer value — by far! — is 0
. Other small numbers, like 1
and -1
are also quite common. And the further away from 0
a number is, the less frequently that number is likely to ever appear in an Ecstasy module file. We leveraged this knowledge when deciding how to compactly store integer values.
Using a byte-by-byte format, such as UTF-8, Portable Object Format (POF) encoded integers, or LEB-128 was deemed to be too computationally inefficient, in that the loop condition for the variable length encoding is re-evaluated within a tight loop (once per byte), completely trashing the CPU's branch predictor, and almost certainly killing instruction pipelining and speculative execution. These integer values are used quite literally everywhere, and for almost everything, so a more efficient encoding scheme was required.
Since a variable length encoding would be used, the resulting format was carefully designed to make the exact length of the entire value apparent from examining only the first byte, allowing all of the branch prediction penalties and possibilities for pipeline stalls to be lumped together at the front edge of the decoding process. Additionally, most CPUs have the ability to load even unaligned power-of-two-length integers from memory into a register and sign extend (either as part of the load, or with inexpensive register-based instructions), so the design of the format leverages this possible optimization.
The XVM Integer Packing (XIP, pronounced "zip") format employs four encoding modes:
- Small: For a value in the range
64..127
, the value is encoded as the least significant byte of the integer, as is. When reading in a packed integer, if the two leftmost bits of the first byte are not10
, then the XIP'd integer is in small format, and simply requires sign extension of the first byte to the desired integer size in order to form the integer value. - Medium: For a value in the range
-4096..4095
(13 bits), the value is encoded in two bytes. The first byte contains the binary value100
in the most significant 3 bits, indicating the medium format. The 2's complement integer value is formed from the least significant 5 bits of the first byte, sign extended to the desired integer size and then left-shifted 8 bits, or'd with the bits of the second byte. - Large: For any 2's complement integer value from
2..32
bytes (in the range-(2^255)..2^255-1
), letb
be that number of bytes, with the XIP'd value encoded in1+b
bytes. The first byte contains the binary value101
in the most significant 3 bits, and the remaining 5 bits are the unsigned valueb-1
, indicating the large format. (Note: Sinceb
is at least2
, the valueb-1
is always non-zero.) The followingb
bytes hold the 2's complement integer value. - Huge: For any integer value
n
larger than 32 bytes, letb
be that number of bytes. The first byte of the XIP'd integer is0xA0
, which indicates the huge format. The following bytes contain the valueb
, encoded as a XIP'd integer. The followingb
bytes hold the 2's complement integer valuen
.
The Ecstasy implementation for writing XIP'd integers is the DataOutput class, and the implementation for reading is the DataInput class.
The Java implementations for reading and writing XIP'd integers can be found in the PackedInteger class.
In Summary
Having a consistent and efficient mechanism to encode arbitrary 64-bit integers in a byte stream is a fundamental boost for an IR designer, because they no longer worry about the ugly trade-off between "will this support big enough structures in the real world?" and "will this waste space?".
Note: The Ecstasy language and the xqizit platform are both rapidly evolving, so the information in this blog post may be outdated. If you are interested in the latest, please check out our most recent blog posts and guides below!
Interesting? Like to learn more?
If the ideas in this blog post resonate with you, get started with Ecstasy and xqizit cloud today, or explore our blog posts, tutorials, and guides to learn more!
GET STARTED
Get started with xqizit cloud
If these ideas resonate with you, you can get started with xqizit cloud today. It’s free.
EXPLORE
Stay up to date with the latest!
If you like to explore the latest ecstasy and xqizit blog posts and guides, you can find them here!