# Editorials

Problems with the storage of decimal fractions in binary
Feb. 04, 2000 (Revised)

First, let's consider the representation of decimal integers in binary:

The decimal number 11 is 1011 in binary because, multiplying by the power of the columns, you get:

1*(2^3) + 0*(2^2) + 1*(2^1) + 1*(2^0) = 11

With each additional digit you double the size of the decimal number that can be stored because you raise the base of 2 by one more power. This is why 24 bits can represent is 16 million colors.

2^24 = 16,777,216.

Now let's consider the representation of decimal fractions in binary:

Each digit in a binary fraction represents an inverse power of 2. (Remember that the zeroeth power is taken for integers.) I'll use the standard convention of representing inverse powers with negative exponents. So,

2^(-1) = 1/(2^1) = .5
2^(-2) = 1/(2^2) = .25
2^(-3) = 1/(2^3) = .125
2^(-4) = 1/(2^4) = .0625

So the binary fraction .1101 represents the decimal fraction .8125 because, multiplying by the power of the columns, you get:

1*(2^(-1)) + 1*(2^(-2)) + 0*(2^(-3)) + 1*(2^(-4)) = .8125

With each additional digit you double the number of decimal fractions that can be represented in binary.

Putting integers and fractions together, the decimal number 11.8125 is represented in binary as 1011.1101

Unfortunately, the system of binary storage has some major pitfalls. Considering four bits of fractional binary precision, lets look at some of the decimal fractions that can be represented:

Binary .0000 is .0000 in decimal.
Binary .0001 is .0625 in decimal.
Binary .0010 is .1250 in decimal.
Binary .0011 is .1875 in decimal.
Binary .0100 is .2500 in decimal.
Binary .0101 is .3125 in decimal.

So if we had the decimal number .05 and we needed to represent it with four binary digits, we simply couldn't do it. Instead we would have to round it to the nearest decimal fraction that can be represented with four binary digits, which is .0625.

The is the infamous rounding error that causes problems with displaying nearly coincident faces in the viewport Z buffer and with booleans performed on objects that are really small. Those fractional sizes are raising utter hell with the binary calculations because of the rounding error.

Internally, fractions are stored as floating point binaries in memory. Ever wonder what "floating" means in the term "floating point"? Well, each number in the computer is represented by a fixed number of bits. But, depending on the need of the number at hand, the decimal point can "float" from one end to the other:

110110.10 would be a large number with low fractional accuracy, whereas,
1.1011010 would be a small number with high fractional accuracy.

Note that in both cases, exactly 8 bits were used, but the placement of the decimal had a big impact on the accuracy of the stored number. This is why we have an Origin slider in the Preference dialog of MAX 3.0. It allows you to visualize the tradeoff between the size of the number and its fractional accuracy. [Search on "General Preferences Settings" in the Online Reference for more info on this new feature.]

Now, of course, the decimal point isn't literally stored in memory like a mysterious dot. Here is William L. Bahn's explanation about how floats are actually stored internally:

--
 Author: William L. Bahn Organization: Dynasys Technical Services Date: 20 Nov 1997 11:25:42 GMT

It's basically the same as converting whole numbers into binary.

Using fixed point math, if I want to represent the number 5.6 in binary using the format

ABC.DEFGH

(i.e., one byte with five places to the right of the radix point)

The weightings are:

A= 2^(2) =4
B= 2^(1) =2
C= 2^(0) =1
D= 2^(-1) = 1/2 = 0.5
E = 2^(-2) = 1/4 = 0.25
F = 2^(-3) = 1/8 = 0.125
G = 2^(-4) = 1/16 =0.0625
H = 2^(-5) = 1/32 = 0.03125

So the answer would be 101.10011 = 5.59375 (Notice the rounding error. -- JC)

There are lot's of algorithms and most of the ones that work for whole numbers can be extended to do fixed point numbers. Conceptually, the easiest way to do this is to move the radix all the way to the right by multiplying by the appropriate power of two. In this case, multiply 5.6 by 32 and you get 179.2. Any fractional part is beyond your ability to represent with the number of digits in your representation, so either truncate them or figure out a way to round the one's digit. This leaves you with 179 and the binary representation for 179 is 10110011, which is what we've got above.

So how would we represent 5.6 in a two byte floating point representation where six bits are used for the exponent? If doing it by hand, the first step might be to convert the number to a fixed point number and then shift the radix point the proper number of places. If we are using 6 places for the exponent, than that leaves us with 10 for the mantissa - but we actually get 11 because we already know that the first digit will be a one, always. So why store it. We'll just make the hardware such that the leading one is always understood to be there.

So 5.6 as a fixed point number with 11 bits would be 101.10011010 = 5.6015625

In exponential notation, this would be

1.0110011010 x 2^(10)

So our floating point number would be

0110011010000010

Remember, the rightmost six bits are the exponent and the radix is to the far left with a missing, but understood 1 in the next position to the left.
- -

Another way to combat rounding error is to use more bits to describe the number and thereby increase the accuracy. This is where the term "double-precision" comes from, since a double-word (or 32 bits) is used to describe the floating point number instead of a singe-word (or 16 bits). In MAX 3.0 for example, the viewport z-buffer was upped to double-precision instead the single-precision in MAX 2.5. This means that MAX 3.0 will be better able to display nearly coincident faces correctly in the viewports.

Unfortunately, the penalty of double-precision is the doubled memory usage required to store it. There have been debates over pushing the storage of all floats in MAX from single- to double-precision. This would greatly benefit people performing booleans on small objects and/or pushing their objects great distances from the Origin. But it was deemed by the developers that this would push the memory requirements for scenes too high, and I tend to agree. MAX is RAM hungry enough as it is.

If you really want to delve deep into the binary representation of fractions, let me suggest this excellent page by William L. Bahn: Binary Representations - Floating Point.

James Coulter