How Computers Use Numbers

With the invention of the vacuum tube we have tricked electrons into doing simple logic. However, that's not too useful on its own. While it is enough to perform basically any computation, actually having numbers would be very helpful — basically everything that a modern computer does is based on arithmetic in some way, for example:

A spreadsheet can contain expressions with mathematical operators
Playing a game might require calculating enemy positions
Drawing this web page requires computing the positions and sizes of all the elements and the text characters inside those elements

So, let's invent some!

The goal

First, we need to find what traits a good number format has, and thus we should be aiming for. These could be anything, but here are some of the ones I think are important (and, coincidentally, modern computers optimize for):

Usefulness in real programs: Perhaps the most important trait, as the whole point of having the capability to do math is for programs to use it. Additionally, fitting as many use cases into as few formats as possible would be ideal, as having more ways to store numbers than necessary would make processors a lot more expensive and power‐hungry.
Simplicity: Similarly to avoiding too many formats, straightforward logic would also save on price and power usage.
Efficient memory usage: Memory is a limited resource; therefore using it as efficiently as possible would be ideal, for example by avoiding duplicate values.

Wait, what even is memory?

Click on the bits to toggle them!

Memory is a collection of cells holding one of two values (i.e. a bit). These can represent basically anything that has two states, though usually they are visualized using ones and zeroes.

Note that bits are usually grouped in eights (called a byte), so using a multiple of 8 would be easiest to work with. We'll use just 8 bits for simplicity, but all of the following number formats come in a variety of sizes.

Counting with bits

Similarly to how we use the 10 digits to count in base 10, we can use bits to count in base 2. By assigning each bit to a power of 2, we can use our 8 bits to store numbers between 0 and 255 ( $2^{8} - 1$ ). This format is called an 8‐bit unsigned integer.

Negative numbers

An obvious way of defining negative numbers (signed integers) is to just use 1 bit as a sign.

However, this way of doing it has just one small issue: there are two ways to write 0, as it is neither positive or negative.

Instead, we could interpret the number as if it was unsigned, but then offset it down by 128 ( $2^{7}$ ) to be able to represent numbers from −128 to 127:

At first glance this way of doing negatives seems perfect! After all it's very simple, it covers a useful range and all the bit patterns correspond to unique integers.

But we can do better.

Two's complement

While implementing negative numbers via a fixed bias seems relatively optimal, we can actually significantly cut down on the amount of circuitry we need via a neat trick!

A hypothetical computer which uses offset signed integers and regular unsigned integers would need to implement arithmetic and comparisons for both signed and unsigned integers, as well as conversions between the two.

Since arithmetic and conversions are only defined for where the ranges of the two formats overlap, could we have the 0–127 range use the same bit patterns for both signed and unsigned integers? It turns out that the answer is yes! There are several ways to achieve this, but by far the most common approach is making the most significant bit subtract 128 from the total instead of adding it.

Remember that “unsigned integer” and “two's complement” are just ways of interpreting a collection of bits. Because all the relevant bit patterns match, we can convert for free by just changing how we're interpreting the data.

(this toggle affects the demo above)

Additionally, two's complement allows us to drop the circuitry for signed addition, subtraction and most of multiplication, and if you're familiar with the p‐adic numbers you might already have a hunch as to why.

Notice that if the result would need 9 bits to be stored fully, the last bit would be dropped. This is called overflow and is what allows us to share the signed and unsigned logic. Dropping the last bit means that we're actually working in modular arithmetic (specifically mod 256), where numbers “wrap around”, so the number after 255 is 0. In this system, every number has an additive inverse even without including negative numbers: for example, the inverse of 2 is 254, since $2 + 254 \equiv 0 (mod 256)$ . Because 254 behaves like −2, it may as well be −2.

If you're familiar with the p-adics, those share the same property of not having an explicit sign but arithmetic working the same as for positive numbers, which is why negative numbers in two's complement look like 2-adic integers with the infinite ones “chopped off”.

A binary point in the middle

Since we have mastered integers, we can now go fractional. Let's start by taking a regular 32‐bit unsigned integer and putting a binary point in the middle:

This is called fixed point, and it's not terrible. We have drastically cut the range of numbers we can represent, without including negative numbers (as an integer, we could represent the range 0–4,294,967,295 with 32 bits, and this format has only 0 to just under 65536), and the precision we get out of that sacrifice isn't even that great. We could try putting the binary point somewhere else, but there really isn't a spot that gives both decent range and precision.

However, it is very simple, and the fact that addition and subtraction work the same as for integers does make fixed point a good option in some niche cases.

Better accomodating real‐world data

Having uniformly spaced values, despite being great for integers, hasn't worked out so well for fractions. So, how should we distribute them? Well, since we want to fit as many use cases as possible into one format, both atom‐sized and universe‐sized numbers should be supported. But in practice, universe‐sized numbers are less precise (in absolute terms) than atom‐sized ones. We can take advantage of this.

Moving the binary point

Since larger numbers are inherently less precise, could we skip defining all those decimals for large values? Well, yes! Let's dedicate, say, 5 bits to defining where the binary point is. For large values, we can move the point to the right to not store the unnecessary fractional part, and for small numbers, we can omit all the integer bits by moving the point to the left.

(green corresponds to point position)

Also, since 5 bits give us more possible bit positions than there actually are with the remaining 27 value bits, we can use the remaining ones for hypervalues, by putting extra virtual zeroes to fill in the space.

However, even with this range extension, the format is still extremely wasteful — for example, you can write 0 in 32 different ways as the point position does not matter. In fact, only ~52% of all bit patterns produce unique values in this system. Yikes!

Scientific notation

In order to ensure uniqueness, we can take inspiration from a way you might see numbers written down in the real world — scientific notation! We will need to adapt it to better suit our needs, though.

Remember that a number in scientific notation looks like $2.574 \cdot 10^{4}$ , where the mantissa is a number with one digit in front of the decimal point and an arbitrary amount of digits after, and the exponent is any integer.

For the exponent, let's start by dedicating 8 of our 32 bits to it. Since we need both positive and negative exponents to represent large and small numbers respectively, we need to use a signed format. We could try doing two's complement again, but we wouldn't actually be able to take advantage of its benefits very well in this case (as there is no existing unsigned logic we can reuse). Instead, let's just make it an unsigned integer and subtract 127 from it, as that would make the underlying implementation simpler.

For the mantissa, let's just use the remaining 24 bits as a fixed‐point number with 1 binary digit before the binary point. As for its sign, note that two's complement or shifting would be incredibly awkward to use here. This is because we need to represent a balanced range, and ideally there would be an equal amount of values on either side of 0. But that would make the total amount of values odd ( $2 \cdot positive values + 1$ ), and the amount of values we have to work with is even ( $2^{bits}$ ). Therefore, since we can't use the extra value effectively, let's just waste it in the simplest way possible: dedicate one bit of the mantissa to the sign, negating the value when the bit is 1. (The extra value goes to −0.)

Exponent bit patterns
Bit pattern	Meaning
11111111	$2^{128}$
11111110	$2^{127}$
11111101	$2^{126}$
⋮
00000010	$2^{−125}$
00000001	$2^{−126}$
00000000	$2^{−127}$

(red – sign bit, green – exponent, blue – mantissa)

Actually ensuring uniqueness

Our format does not guarantee uniqueness yet. This is because we skipped one requirement of scientific notation: the first digit of the mantissa must be non‐zero (otherwise you can change the exponent and get the same value). So, we have to force the first digit of the mantissa to be non‐zero.

Notice that since we're working in binary, there is only one non‐zero digit: one! That means that we can omit storing it entirely, which allows us to double the mantissa's precision. Nice!

Oh wait

Remember when I said that the first digit of the mantissa can't be zero? That's true, except for exactly one number: 0. Thus, we can't write it anymore. So, let's just set the all zeroes (except for possibly the sign bit) bit pattern to represent ±0.

Exponent bit patterns
Bit pattern	Meaning
11111111	$2^{128}$
11111110	$2^{127}$
11111101	$2^{126}$
⋮
00000010	$2^{−125}$
00000001	$2^{−126}$
00000000	$2^{−127}$ & special case: 0

Handling math with varying precision

Because of the nature of having limited precision, we're going to encounter results that don't have bit representations quite often when doing basically any kind of math. (Indeed, the 4 billion values cover exactly 0% of the number line.)

The obvious solution to that is rounding: after each mathematical operation, if the result does not have a bit pattern in our system, it snaps to the closest available value. (The default rounding mode is supposed to be rounding half to even, as it minimizes error accumulation — but most CPUs use round‐to‐nearest by default anyway.)

Notice that now, each bit pattern actually represents a range of values — the actual result that rounded to it could have been anywhere within the rounding window.

Also, remember how we had an extra zero? We can now have +0 represent the positive half of the range and vice versa.

A giant gap

Our format is currently not great at representing small numbers. Sure, we can store 0, but then we have a comparatively huge gap from anything else: the smallest non‐negative numbers in our format are 0, then $1.000000119 \cdot 2^{−127}$ , then $1.000000238 \cdot 2^{−127}$ , and so on. That means that the second gap is only 0.000012% of the first one.

To rectify our small number issues, we can make the whole lowest exponent have its mantissa start with a 0 instead of just one specific value. (We also need to increase it by 1 to not leave a hole where the $2^{−127}$ ‐sized numbers originally were). Let's call these subnormal numbers.

The smallest representable positive numbers are now $2^{−149}$ , then $2 \cdot 2^{−149}$ , $3 \cdot 2^{−149}$ and so on. That means that not only do we have much more uniform gaps, but we even extended the range a little (though at the cost of some precision).

Exponent bit patterns
Bit pattern	Meaning
11111111	$2^{128}$
11111110	$2^{127}$
11111101	$2^{126}$
⋮
00000010	$2^{−125}$
00000001	$2^{−126}$
00000000	$2^{−126}$ (subnormals)

Extending the range further

Having a special value for numbers outside the normal range would be useful. For example, since zero could actually be any number close to 0, dividing by it would result in an actual value! We can't know it exactly (since we threw away that precision), so it would just result in the too‐big sentinel. This means that it effectively covers the part of the number line from the (new) largest regular value all the way to ∞.

So, let's dedicate the current largest regular value's bit pattern to it! Since our format is symmetric we'll also define a negative counterpart to have the same bit pattern as the positive one except with the sign bit being 1.

Properties of an indeterminate value

Since the our new values cover infinitely large parts of the number line, the numbers they represent could be pretty much anywhere. So, how does math work with a number that we only know is very large? Well, in practice, it only really makes sense to give it the same properties that ∞ (or, more precisely, a limit that diverges to ∞) has traditionally, namely that it's contagious — most operations involving infinities will result in another infinity:

$\infty + 1 = \infty$
$\infty \cdot (−2) = −∞$
$\infty + \infty = \infty$
$\infty > 2$ is true
$\infty - \infty = ?$
$\infty \cdot 0 = ?$

Since our new value behaves like an infinity, it would make sense to call it that as well.

Exponent bit patterns
Bit pattern	Meaning
11111111	$2^{128}$ & special case: infinities
11111110	$2^{127}$
11111101	$2^{126}$
⋮
00000010	$2^{−125}$
00000001	$2^{−126}$
00000000	$2^{−126}$ (subnormals)

Dealing with those question marks

Since two infinities could have been a result of rounding numbers of very different magnitudes, we can't assign any sensible value to $\infty - \infty$ . Similarly, $\infty \cdot 0$ can't have a definite value assigned to it — a very big number we don't know exactly times a very small number we don't know exactly could be literally anything.

So, what should happen when a question‐mark operation happens? Well, first of all, an error should probably be raised somewhere. However, some programs might not want to be interrupted when working with these kinds of values, so we can't do much more than setting a flag in the processor.

But, since a question‐mark operation can't suspend execution, it needs to return a value. Let's call it Not a Number, or NaN. Now, where do we put it?

Note that since there are several different operations that could produce NaNs, it might be useful to tell what produced a NaN from the NaN itself. In other words, we need to be able to attach extra data to a NaN.

Since we need to attach extra data to NaNs, let's dedicate an entire exponent to them, say, the highest one. We'll need to relocate the infinities, so let's redefine them to be an all 0 mantissa with a maximum exponent (and an arbitrary sign bit). All the other mantissa values can now correspond to NaNs.

Properties of an even more indeterminate value

A NaN, similarly to an infinity, could theoretically be an actual number where all of the information about it has been discarded. It could also be something which isn't a real number at all, such as a square root of a negative number. Therefore:

NaNs are contagious, like infinities — pretty much all arithmetic operations involving a NaN will produce another NaN, since we can't derive a specific value from a lack of information
All comparisons with NaNs return false
- This means that $x = x$ is false if and only if x is NaN, which can be used as a test for whether you're dealing with one
- false isn't explicitly meaningful here, but comparisons have to return a value and false is marginally more convenient due to allowing for an easy NaN test
While generating a NaN will usually set an error flag, using it in operations does not do that
However, a NaN that emits an error when used could still be useful, so let's add the option for NaNs to be signaling
- When used in an arithmetic operation, a signaling NaN will set the invalid operation error flag and the result will be a regular (quiet) NaN
- This could, for example, be used for additional error‐checking
- In practice, signaling NaNs aren't very common

Exponent bit patterns
Bit pattern	Meaning
11111111	Infinities and NaNs
11111110	$2^{127}$
11111101	$2^{126}$
⋮
00000010	$2^{−125}$
00000001	$2^{−126}$
00000000	$2^{−126}$ (subnormals)

Conclusion

We have now reinvented the three most common number formats used by modern computers: unsigned integers, two's complement integers and IEEE 754 floating point numbers (well, mostly — some details have been left out, mostly regarding NaNs and some arithmetic edge cases).

Note that these formats come in a variety of sizes — 8‐bit, 16‐bit, 32‐bit and 64‐bit integers are common, and floats typically come in 32‐bit and 64‐bit (which has 11 bits of exponent and 52 bits of mantissa) varieties.

While the basic functionality of the above‐mentioned number formats has been explained, they also lend themselves to various bit-manipulation exploits. I won't be explaining them here, but here are some of my favourites:

Also, while integers and floats are what the processor works with directly, programs can use more complex number formats based upon the fundamental ones, for example complex numbers are represented as a pair of floats, arbitrary‐precision integers are a big block of data interpreted as an integer in software and decimal numbers have their digits represented with groups of bits (which is useful if you're working with something inherently base‐10 and don't want to lose precision, such as currency).

Lastly, if you ever want to play with floating-point numbers, check out float.exposed! It provides support for more floating-point formats and has additional visuals over the demos in this article.