Cross-Platform Encoding of Arbitrary-Precision Integers

In the previous post, we developed a format for sharing integers between machines, regardless of machine-specific architecture, such as endianness and machine word size. With this algorithm, we can convert an integer to bytes that can be sent over the network or saved to a file and back again. However, while the file format is independent of machine architecture, the decoder algorithm was limited to the machine word size of the machine it is running on (for instance, 32-bit or 64-bit, depending on the specific microprocessor and operating system). In this post, we will update the algorithm to handle integers beyond this size, making it fully independent of machine architecture. The previous post is essential reading for understanding this one.

An arbitrary-precision integer (also know as a bigint) is a computer implementation of integer math that allows the numbers to be any size that will fit into memory, as opposed to machine words, which are integers that must fit into a microprocessor register and so are limited to a much smaller size, for instance 32 or 64 bits. Of course, integers in math have no such limitations and can be any size. However, when we are doing math with computers we are stuck with some specific implementation and many programming language do not even support bigints. For a portable language, bigint support is essential as the alternative, to use only machine word size integers in registers, is inherently not portable since machine word size varies between machine architectures. The downside of bigint math is that it is slower than register math. Since microprocessors can only actually do register math, bigint math is a layer on top of that. It breaks down each bigint math operation into a series of register math operations. The bigger the bigint, the more register math operations are required to implement it, leading to potentially slow math operations. Due to this performance penalty, languages that have built-in support for bigints usually default to a register math implementation for integers that are small enough, switching to bigint math only when necessary. This feature is called autopromotion. Python has autopromotion, so the python code we wrote in the previous post doesn't need to be updated to support bigints. It's no wonder that many people love python!

In order to show off how to handle this in a less convenient language, I have written a C++ implementation of the core of the upgraded decoder and related encoder. My C++ is not great, but fortunately it doesn't need to be in order to demonstrate the algorithm. I also use some shorthand defined elsewhere, so this code will not compile directly.

Here is the core of the decoder:

varint expand_int_from_bytes(bytes bytes)
{
  // We have to used an unsigned int here because in the base that we end up with a bigint, all of the limbs should be unsigned.
  // If we use a (signed) int, we will lose the top bit.
  ints integers = ints();
  for(int count = 1; count <= bytes.size(); count++)
  {
    // We ran out of bytes in the queued integers
    if(count % sizeof(int) == 1)
    {
      integers.insert(integers.begin(), 0);
    }

    // Shift all of the integers one byte left
    for(int index = 0; index < integers.size(); index++)
    {
      if(index == integers.size() - 1)
      {
        integers.at(index) = static_cast<int>((static_cast<unsigned int>(integers.at(index)) << 8) | (unsigned int)bytes.at(count - 1));
      }
      else
      {
        unsigned int current = static_cast<unsigned int>(integers.at(index));
        unsigned int next = static_cast<unsigned int>(integers.at(index + 1));

        int shift = (sizeof(int) - 1) * 8;
        unsigned int nextHighByte = (next >> shift) & 0xFF;

        integers.at(index) = static_cast<int>((current << 8) | nextHighByte);
      }
    }
  }

  // In this special case, we might have a valid max size (signed) int, or we might have a max size unsigned int that is too big to fit into a signed int because of the sign bit.
  if(integers.size() == 1)
  {    
    unsigned int unsignedInteger = static_cast<unsigned int>(integers.at(0));

    if(unsignedInteger <= std::numeric_limits<int>::max())
    {
      return std::variant<int, ints>(static_cast<int>(unsignedInteger));
    }
  }

  return std::variant<int, ints>(integers);
}

With bigints, instead of an machine-dependent integer (int), we have an array of them (ints). Each individual int is known as a "limb" in bigint terminology. We loop through the bytes to decode them. Whenever we have more bytes to decode than the size of a machine word, "sizeof(int)" in C++, we pop a new machine word into the array. In the previous implementation, the fundamental operation for decoding was to shift the machine word up by 8 bits and then add the new byte to the bottom 8 bits. However, when doing anything with bigints we need to iterate over the whole array of machine words. There is added complexity at the boundaries between the machine words. In this case of shifting up by 8 bits, this means we need to take the top 8 bits of the lower word and put them into the bottom 8 bits of the higher word. All of this shifting is so that the lowest word has it's bottom 8 bits free for the new byte. It's a little bit complicated to describe, but the code is not particularly long to read.

There is one bit (literally) of added complexity here, which is based on an implementation choice I made. The logical choice for storing bigint limbs is an "unsigned int", which used all available bits for the value, giving up the possibility of having any negative integers whatsoever. Since the limbs in bigints are not actually integers themselves, but merely data representing the different parts of an integer in terms of machine words, each individual machine word in the array doesn't not need to have a sign associated with it. However, for this particular project I have chosen to use only the "int" type for my arrays, for reasons that are not relevant to this post. Therefore, we need to safely convert back and forth between "int" and "unsigned int", which adds a few casts. I mention this mainly because it is related to a specific edge case of interest that I realized when testing the algorithm. You sometimes end up in a situation where your array of machine words contains only one element. It would seem obvious that this is would be a case where you would just use a single machine word instead of a bigint. However, my choice of signed ints means that the overflow from small to bigint occurs one bit earlier, when the int's sign bit becomes used to store the int's value. At this point, switching to an array representation (where all elements are implicitly unsigned) solves the issue. This case is handled in the lines few lines of the algorithm, where we decide whether to return a machine word or a bigint. Something not visible in this code is how we encode the sign of the bigint into this array of machine words. We simply prepend another machine word for the sign, 0 for positive and 1 for negative.

Now this array-based representation is only internal to the C++ implementation. The format remains the same, a length followed by a number of bytes as specified by the length. So, to send a bigint we need to turn it back into this format. Here is the encoder that we used when we have a bigint instead of a machine word:

bytes squeeze_bigint(ints integers)
{
  // Allocate the result
  bytes r;

  // Remember if the integer was positive or negative. If negative, we will make the length negative later.
  // Regardless, we end up with a positive integer for the purpose of encoding.
  // Positive integers are more straightforward to encode.
  int negative = integers.at(0);
  integers.erase(integers.begin());

  for(int signedInteger : integers)
  {
    unsigned int unsignedInteger = static_cast<unsigned int>(signedInteger);

    bytes integerBytes = bytes();
    for(int offset=0; offset<sizeof(unsignedInteger); offset++)
    {
      // Get each byte of the integer separately, encoding all bytes, including zeros
      integerBytes.insert(integerBytes.begin(), unsignedInteger & 0xFF);
      unsignedInteger = unsignedInteger >> 8;
    }

    r.insert(r.end(), integerBytes.begin(), integerBytes.end());
  }

  byte length = (byte)r.size();

  // If the integer is negative, we make the length byte negative as well
  // The length byte goes at the beginning of the encoding
  if(negative)
  {
    length = length | 0x80;
  }

  r.insert(r.begin(), length);

  return r;
}

The first thing we do is remove the first byte of the array, which will be the sign word, 0 for positive and 1 for negative. Then we iterate through the array, converting each element to an unsigned int, getting the low byte, and then shifting that element down until we are out of information to encode. Then we encode the sign and length onto the front of the byte array as usual.

So that's the algorithm, no need to update the format from the last post, we just decode in a way that works for our specific machine architecture. This code is portable to any machine architecture that has a compatible implementation of this version of C++. In this case, I used Arduino C++, which is very limited in its capabilities. It won't compile on every Arduino in the world, because the limited capabilities of Arduino C++ vary based on the specific microcontroller. This is because Arduino is a thin compatibility layer on top of platform-specific implementations and tools. Code written for Arduino is inherently not completely portable (oh no!) but it's more portable than using the vendor-specific tools for each microcontroller. At least this implementation of the algorithm is not limited by the machine architecture, just whether the platform has a compatible version of C++. You could, of course, write the algorithm in different language so as not to rely on these platform-specific C++ features. However, I know of no choices that will actually result in a higher level of portability, unfortunately.

Now that we have covered the algorithm, I'd like to discuss the question 0f "why bother?" I have noticed that code is, in general, extremely unportable. This is deeply inconvenient in many ways. I work on a lot of platforms and, specifically, not all of them are general-purpose computers with desktop-style operating systems. I am often constrained in what programming languages are available on the platform. Code written in an unavailable programming language is useless to me. Similarly, much code relies on libraries and if those libraries are not available on the platform, then once again the code is useless to me. For many years I have been very frustrated with the inability to use modern programming language features and tools on the platforms for which I need to write code. This is particularly frustrating considering that systems with modern features have been written for hardware that is much less capable than what we have today. Smalltalk-72 was written for the Xerox Alto, which had at most 512KB of RAM and 2.5MB of storage, comparable to modern microcontrollers except the microcontrollers are also much faster. Why have we not upgraded our programming environments to a system with capabilities comparable to those that were available 50 years ago? Indeed, why can we not just run Smalltalk-72 on today's microcontrollers? Well, it's just not that portable. Over the years, people have ported various systems with cool and convenient modern features to newer and newer machines (mostly with desktop operating systems). Then those machines go out of fashion and everything has to be ported to the new, new machines. Some systems get left out and are lost. You can run the old systems through emulation, an approach that has recently bridged the gap from desktop operating systems to microcontrollers. Now people are using microcontrollers like the RP2040 to emulate old computers like the BBC Micro. This is very cool, if what you want to do is actually run old applications for the BBC Micro, most likely due to nostalgia or curiosity. However, emulation has some drawbacks, most significantly that it doesn't let you write new software that can access the new capabilities of the new machines. Therefore, for the most part, emulation is not a target for developers of new software (with the major exception of retro-style games). What I think we need is to write portable software that can run on just about anything. Unfortunately, this is not how most code is written. Many programming languages never get past machine words, relegating cross-platform features such as bigints to libraries that also use non-portable implementations. So, while this is a small effort in the grand scheme of things, I figure we have to start somewhere and so I have developed this simple, portable method for moving integers around. If you enjoy it, feel free to implement it for your programming language or platform of choice.

In my next post, I'll discuss portable sharing of floating-point numbers, a bizarre alternative to integers that are useful in audio processing.