It's the little things …

TEAM BLOG

It’s the little things.

A consistent and efficient mechanism to encode arbitrary 64-bit integers in a byte stream

`xtclang` `language-design` `design principles` `XVM Integer Packing` `XIP`

Last edited: @August 15, 2021

Ecstasy compiles to an Intermediate Representation (IR), which is a binary format intended to carry a rich set of information that can be used by the Ecstasy adaptive runtime compiler back-end to generate the actual native code that will run on a machine's CPU.

In the language and compiler field, Intermediate Representation is also know as: Intermediate Language (IL), byte code, bit code, p-code, and several other names.

Ecstasy IR was designed to be extremely tight (small), but it was also designed to support extremely large projects. To that end, all of the IR nodes and offsets (a.k.a. addresses) use 64-bit values; for example, the body of function can be 2^63 bytes long, and can reference up to 2^63 predefined constant values, and so on. All of these values, sizes, offsets, indexes, addresses, etc. are encoded as 64-bit values

These two requirements are in natural conflict. The first says "keep it small", while the latter says "make it large". And this problem applies to many things besides addresses: register numbers, parameter counts, references into the constant pool, and so on. We recognized that solving this conflict in a uniform and efficient manner, and doing so up front, would be critical in keeping the design simple and consistent.

Despite needing to support 64-bit integer values, it turns out that the most common integer value — by far! — is 0. Other small numbers, like 1 and -1 are also quite common. And the further away from 0 a number is, the less frequently that number is likely to ever appear in an Ecstasy module file. We leveraged this knowledge when deciding how to compactly store integer values.

Using a byte-by-byte format, such as UTF-8, Portable Object Format (POF) encoded integers, or LEB-128 was deemed to be too computationally inefficient, in that the loop condition for the variable length encoding is re-evaluated within a tight loop (once per byte), completely trashing the CPU's branch predictor, and almost certainly killing instruction pipelining and speculative execution. These integer values are used quite literally everywhere, and for almost everything, so a more efficient encoding scheme was required.

Since a variable length encoding would be used, the resulting format was carefully designed to make the exact length of the entire value apparent from examining only the first byte, allowing all of the branch prediction penalties and possibilities for pipeline stalls to be lumped together at the front edge of the decoding process. Additionally, most CPUs have the ability to load even unaligned power-of-two-length integers from memory into a register and sign extend (either as part of the load, or with inexpensive register-based instructions), so the design of the format leverages this possible optimization.

The XVM Integer Packing (XIP, pronounced "zip") format employs four encoding modes:

Small: For a value in the range 64..127, the value is encoded as the least significant byte of the integer, as is. When reading in a packed integer, if the two leftmost bits of the first byte are not 10, then the XIP'd integer is in small format, and simply requires sign extension of the first byte to the desired integer size in order to form the integer value.
Medium: For a value in the range -4096..4095 (13 bits), the value is encoded in two bytes. The first byte contains the binary value 100 in the most significant 3 bits, indicating the medium format. The 2's complement integer value is formed from the least significant 5 bits of the first byte, sign extended to the desired integer size and then left-shifted 8 bits, or'd with the bits of the second byte.
Large: For any 2's complement integer value from 2..32 bytes (in the range -(2^255)..2^255-1), let b be that number of bytes, with the XIP'd value encoded in 1+b bytes. The first byte contains the binary value 101 in the most significant 3 bits, and the remaining 5 bits are the unsigned value b-1, indicating the large format. (Note: Since b is at least 2, the value b-1 is always non-zero.) The following b bytes hold the 2's complement integer value.
Huge: For any integer value n larger than 32 bytes, let b be that number of bytes. The first byte of the XIP'd integer is 0xA0, which indicates the huge format. The following bytes contain the value b, encoded as a XIP'd integer. The following b bytes hold the 2's complement integer value n.

The Ecstasy implementation for writing XIP'd integers is the DataOutput class, and the implementation for reading is the DataInput class.

The Java implementations for reading and writing XIP'd integers can be found in the PackedInteger class.

In Summary

Having a consistent and efficient mechanism to encode arbitrary 64-bit integers in a byte stream is a fundamental boost for an IR designer, because they no longer worry about the ugly trade-off between "will this support big enough structures in the real world?" and "will this waste space?".

Note: The Ecstasy language and the xqizit platform are both rapidly evolving, so the information in this blog post may be outdated. If you are interested in the latest, please check out our most recent blog posts and guides below!

💡

Interesting? Like to learn more?

If the ideas in this blog post resonate with you, get started with Ecstasy and xqizit cloud today, or explore our blog posts, tutorials, and guides to learn more!

GET STARTED

Get started with xqizit cloud

If these ideas resonate with you, you can get started with xqizit cloud today. It’s free.

Get started free

EXPLORE

Stay up to date with the latest!

If you like to explore the latest ecstasy and xqizit blog posts and guides, you can find them here!

Explore the latest