In my last two posts, I covered how to more integers between machines, without regard to the specific machine architecture. In this post I'll cover doing something similar with floating-point numbers (also called floats).

Floating-point numbers can be used to poorly approximate the real numbers, as well as large integers. Mathematically, they are quite dissimilar and there are many practical problems with using them. Irresponsible use can cause disaster. However, they have some pragmatic uses, for instance in audio processing. Like all numbers represented by computers, they are limited to a specific size. Both 32-bit and 64-bit variants are popular. However, instead of naively allocating this range of possibilities to a linear sequence, as is done with integers and machine words, the available space is divided into different parts. There is a linear part and an exponential part that are combined to make each number. Both very large numbers and very small numbers can be represented by varying just the exponent. Sometimes scientific notation is used here to describe the concept behind floating-point numbers, but it's a tenuous metaphor because the exponent used by computers is not decimal, but binary. In general, the only way to understand floating-point math is to read the specification and try out examples. Of course, there is more than one specification, and computations with 32-bit and 64-bit floating-point numbers will give different answers.

The pragmatic value of floating-point is to represent very big and very large numbers with one data type. You can do things like multiple 2,456,700 by 9,234,117 or 0.00324 by 0.00016 with just a float. This is something you can't do with machine word integers! Indeed, 0.00016 is not even an integer. Since floats are not actually real numbers at all, you can emulate similar operations using only integers if you want. This is called fixed-point and is used on microcontrollers that don't have any floating-point math capabilities. However, in fixed-point you must choose if you want very large or very small numbers, so that's inconvenient. (You can think of this as moving the decimal point around, although again this is binary and not decimal, so there is no decimal point but a binary point.) You can also, of course, fully implement floating-point math in software on top of integer math. This is can be slow, so I recommend just getting a microcontroller with floating-point capabilities if you have the option. Another caveat is that, while you can have either large or small numbers, mixing them together has unpredictable results. Sometimes, you will get a suitable answer, particularly if the calculation just needs to move the exponent around. However, you only have a fixed number of actual bits to use for precision, so you can lose some precision. This is why you should never use floating-point numbers for financial data, because this loss of precision can lose money. People do this, though, irresponsibly and do great disaster.

In audio processing, floats are neat because the fixed-point problem is very familiar to anyone that works in audio and is familiar with volume levels and gain. Too little gain, and you can't hear anything, too much gain and you get clipping. With floating-point audio, you never get clipping. You can have quiet audio and loud audio in the same sample, as long as they aren't at the same time. This is quite convenient in some situations. However, you never get anything for free in computation, you're still limited to the same number of bits per sample whether you are using integer samples or floating-point samples. So what you are essentially doing to the audio is adding a specific kind of distortion. Whether you can hear this distortion or not is left as an exercise to the reader.

So, portable floats! It's pretty straightforward because floats are documented in a series of IEEE standards. However, we still need to deal with the different sizes, such as 32-bit and 64-bit. Python uses 64-bit floats. C++ has two float types. 32-bit floats are called float and 64-bit floats are called double. It is common in programming languages that support both float and double to just let you cast between them and the compiler will handle all of the complexity of doing that conversion. Unfortunately, we still have to deal with endianness. Even though the IEEE standards specify a format that does not rely on endianness, the generally accepted way to make a float from bytes is to just copy the bytes into the address of the float and direct memory access relies on endianness. When serializing things, we also use network byte order, which is big endian, so all we need to do when we deserialize a float is to check if the machine doing the decoding is little endian and, if so, reverse the order of the bytes. You can do with integers, too, but I think that approach lacks elegance and connection to mathematical integers. There is no worry of that with floating-point numbers as they possess little elegance and less connection to mathematical numbers. They are entirely a computer science creation, so copying bytes around and reversing them sometimes is the best we can do.

I will not show the code, unlike with the other examples, because the description above pretty much says it all. There is just one element missing, which is how do you tell if the machine is little endian? Some libraries for different systems have a function to do this, but you can generally check by storing the number 1 in an integer and then accessing that integer as direct memory. If the first byte is 1, then you are on a little-endian machine, otherwise it will be 0 on a big-endian machine.

In my next post, I'll talk about how to write portable data structures, specifically arrays of integers, floats, and of course arrays of arrays.

Portable Floating-Point Numbers