Cross-Platform Encoding of Integers

Working on code for microcontrollers has had me thinking a lot about "portability". What does it mean for code to be portable? This is a complex question, so I thought I'd start by looking at a simple task: moving integers between machines, when those machine might be different operating systems and even different machine architectures.

An integer is an abstract mathematical concept, but in computer science we have a variety of different working definitions, which vary by programming language, operating system, and machine architecture. An integer should be capable of representing a number such as 0, 1, -1, 2, 256, 1000, -1,000,000, etc. In assembly, we need to put our integers into registers in order to do computations with them, and registers have a fixed sized represented in bits. This gives us a range of integers that we can work with directly in assembly and integers outside this range require special attention. The size of a register is often called a "machine word" (although some architectures have more than one type of register). While C is considered to be more portable than assembly, it does not free us from the machine word size restriction on integers. Instead, it introduces a whole bunch of different integers types, such as byte, int, long, uint8_t, etc. Now we can choose a fixed size that doesn't match the size of a machine word (at a potential cost of efficiency), but we're still stuck to a fixed size. Worse yet, the venerable "int" type is fixed width, but the specific fixed width is not specified and is platform-specific, with popular sizes being 2, 4, or 8 bytes. This makes ints inherently difficult to move between platforms as one machine might send an 8-byte int while the other machine excepts it to be 4 bytes.

The machine word size is not the only portability issue. There is also endianness, which is the order of the bytes. Each byte is a base-256 (0-255) value and they are arrayed in a series to form numbers greater than 255, just like we write base-10 digits on paper to form larger decimal numbers. While the paper decimal convention has the ones place on the right, computer architectures support it being on either the right or the left. When sending integers between machines, it might be tempting to just copy the integer from a register into memory and then just copy those bytes to the other machine, but this won't give the correct result of the other machine interprets the bytes in reverse order from what was intended. Access memory directly also has other issues, such as alignment, but since we won't be accessing memory directly at all in our approach, we avoid all of these issues.

The key to a cross-platform representation of integers is to treat them like numbers instead of bytes in memory. This may seem counterintuitive, since the process of moving an integer between machines involves expressing it in terms of bytes, for instance to send to storage or over the network. However, using this approach, we can devise an encoder and decoder that are independent of all machine-specific differences, such as machine word size and endianness.

To start, let's consider small numbers one or two byte numbers, such as 0, 1, -1, 256, and -256. These numbers will fit in the machine word size of any platform of interest, so we don't need to worry about trying to send a number that's just too big for the receiver to handle. That scenario is certainly one to cover, but let's address the simpler case first. Since we don't know what the machine word size of the receiver is, there is no advantage to using a specific number of bytes. Therefore, we might as well use the smallest number of bytes that we can. However, since this is a variable that is unknown to the receiver, we need to include some information about how many bytes we're using. This is particularly important in a situation like using a network connection between the two machines, where we might be sending multiple values in a row. Personally, I prefer length prefixes to delimiters. If we use a one byte length prefix, that will clearly cover all of our cases for fixed sizes of 2-8 bytes. Unless you want to get into manipulating individual bits, the byte is the logical size since it's the unit that networks and filesystems use. Therefore, the minimum overhead our encoding can have is one byte. So our format will by one length byte + some number of encoded integer bytes.

Personally, I don't like the way that computers handle the in-memory encoding of negative integers. While this used to be platform-specific, most systems have standardized on two's complement representation, so you could probably use this format as is. However, it breaks the idea of treating integers as numbers. Therefore, in this format we will only encode positive integers. For negative integers, we'll negate them (to make them positive) and then store the sign bit in the high bit of the length. When decoding, we'll check and clear the high bit of the length and if it was found to be set, we'll negate the decoding result to yield a negative integer.

Once we have a positive integer, encoding is pretty easy. If the integer is zero, then we're done. The length will be 0, so our result will just be the array of bytes [0]. If it's not zero, then we mask off the lowest byte of the integer and the bitshift the integer right by 8 bits (one byte). We add the low byte to our array. Then we repeat. Once our integer is 0, we're done. We take the length of encoded bytes and set the high bit of the length if our original integer was negative, otherwise we leave it clear for a positive integer (or zero).

Decoding is this process in reverse. We check and clear the sign bit on the length, remembering if our result should be negative or not. Then we start with the integer 0 and, for each byte in the encoded bytes we shift the integer 8 bits (1 byte) to the left and add the new byte as if it were an integer. When we're out of encoded bytes, we negate the result if the high bit of the length was high, otherwise we leave it positive.

This algorithm does not care about the memory layout of the integer and so it avoids issues such as machine word size and endianness. It's also portable to any language that has an integer type. It doesn't require direct access to the memory representation of the integer at all. One limitation is that the one-byte length with the high bit used as the sign means that we are limited to integers that can be represented in 127 bytes or less. Since the current practical limit is 8 bytes (although I have seen some 16-byte machine proposed), this should be fine for now.

Here is a python implementation of the algorithm:

# Encode an arbitrary precision integer as bytes, ignoring machine word size and machine endianness
# format:
# - length - 1 byte
# -- high bit indicates the sign of the integer. 0 for positive, 1 for negative.
# - data   - variable size, "length" without the high bit indicates the number of bytes
def squeeze(x):
    # A 0 length indicates the number 0
    if x == 0:
        return bytes([0])

    # If x is negative, we encode the data for abs(x) and encode the sign in the length
    negative = False
    if x < 0:
        x = -x
        negative = True

    # Loop until we have encoded all of the necessary bytes, eliding leading zeros
    # When our working integer is 0, we are out of bytes to encode, the rest of the bytes are also 0.
    result = b''
    while x != 0:
        # Get the lowest byte and add it to the encoding, then remove that byte from the working integer
        b = x & 0xFF
        result = bytes([b]) + result
        x = x >> 8

    # Now we need to encode the length
    length = len(result)

    # A 1-byte length with a sign bit included means that we have a maximum size of 127 bytes (1016 bits), pretty big
    # This check ensures that the high bit of the length is currently set to zero.
    if length > 127:
        raise Exception("integer too long for squeeze")

    # If our integer was negative, set the high bit of the length to 1, otherwise leave it at zero.
    if negative:
        length = length | 0x80

    # Prepend the encoded bytes with the length byte that includes the sign
    return bytes([length]) + result

# Decode a squeeze-encoded integer, returning the integer and any leftover bytes that were not part of the encoding
def expand(x):
    # We need at least one byte to parse the format
    if len(x) == 0:
        raise Exception("Empty input")

    # Separate the length from the encoded bytes
    length = x[0]
    rest = x[1:]

    # If the high bit of the length is set, clear that bit and remember that the integer is negative
    negative = False
    if length & 0x80:
        length = length & 0x7F
        negative = True

    # This extra slice allows us to decode an integer that is part of a larger stream of values, ignoring the rest
    encoded = rest[:length]
    rest = rest[length:]

    # Loop through the bytes, adding them to the working integer
    integer = 0
    for b in encoded:
        integer = (integer << 8) + b

    # If the high bit of the length was set previously, then our integer is negative, otherwise it's positive
    if negative:
        integer = -integer

    # We return both the integer and the extra bytes at the end, allowing us to handle encoding of multiple integers
    # Multiple integers are not covered here as that is handled by a higher layer of abstraction
    return integer, rest

Python does not have a machine word size. It supports bigints, so we have to explicitly check that the integer is not too big to encode with our 1-byte length. While some might say that 127-byte integers are "big enough", I sometimes use 256-byte bigints for calculations. So, we need to deal with this case eventually. This actually opens up a whole world of additional problems with integer representation. How do we decode an integer that is above our machine word size? Along those lines, how do we encode such an integer? We can, indeed, extend our algorithm to cover bigints of arbitrary precision. This implementation of an algorithm to transfer integers between machines without an agreed upon precision is the first step towards that goal.

I should mention a couple of other caveats about this approach. One is that this approach does assume a shared bit order for bits within a byte. This has become standardized so that the ones place within a byte is on the right. However, this was not always the case. Historical computers have done a lot of unusual things. The CDC 6600 used a reverse bit ordering. The DEC PDP-10 used 36-bit words. The UNIVAC 1100 used 6-bit bytes. This sort of thing is why some people to this day insist of calling bytes "octets", meaning 8. However, except for setting historical context these machine architectures bear little relevance to our actual goals here. On modern computers, bytes are octets and machine words are some number of bytes (a power of 2). However, issues such as endianness and varying machine word size show no sign of disappearing anytime soon.

The other thing to consider about this approach is that it is not designed to be as efficient at encoding and decoding as non-portable approaches that use direct memory access. In fact, no attempt has been made to optimize the performance of this algorithm, as optimization requires leveraging non-portable abstractions that treat integers differently than numbers. Indeed, performance is a non-portable concept that has little bearing on design at our level of abstraction. Of course, for practical purpose any algorithm much be performant enough for the workload. Therefore, all discussions of performance should be based on observed performance numbers based on real world workloads. As we do not have such statistics gathered at this time, I have nothing more to say about performance.

In the next post, we'll look at how to evolve our approach to moving small integers between machines to moving big integers (specifically, bigints) between machines.