Fixed Point Arithmetic in C Programming

Question

Welcome To Ask or Share your Answers For Others

Fixed Point Arithmetic in C Programming

1 Answer

深蓝 · Answer 1 · 2021-10-16T23:30:19+0000

The idea behind fixed-point arithmetic is that you store the values multiplied by a certain amount, use the multiplied values for all calculus, and divide it by the same amount when you want the result. The purpose of this technique is to use integer arithmetic (int, long...) while being able to represent fractions.

The usual and most efficient way of doing this in C is by using the bits shifting operators (<< and >>). Shifting bits is a quite simple and fast operation for the ALU and doing this have the property to multiply (<<) and divide (>>) the integer value by 2 on each shift (besides, many shifts can be done for exactly the same price of a single one). Of course, the drawback is that the multiplier must be a power of 2 (which is usually not a problem by itself as we don't really care about that exact multiplier value).

Now let's say we want to use 32 bits integers for storing our values. We must choose a power of 2 multiplier. Let's divide the cake in two, so say 65536 (this is the most common case, but you can really use any power of 2 depending on your needs in precision). This is 2¹⁶ and the 16 here means that we will use the 16 least significant bits (LSB) for the fractional part. The rest (32 - 16 = 16) is for the most significant bits (MSB), the integer part.

     integer (MSB)    fraction (LSB)
           v                 v
    0000000000000000.0000000000000000

Let's put this in code:

#define SHIFT_AMOUNT 16 // 2^16 = 65536
#define SHIFT_MASK ((1 << SHIFT_AMOUNT) - 1) // 65535 (all LSB set, all MSB clear)

int price = 500 << SHIFT_AMOUNT;

This is the value you must put in store (structure, database, whatever). Note that int is not necessarily 32 bits in C even though it is mostly the case nowadays. Also without further declaration, it is signed by default. You can add unsigned to the declaration to be sure. Better than that, you can use uint32_t or uint_least32_t (declared in stdint.h) if your code highly depends on the integer bit size (you may introduce some hacks about it). In doubt, use a typedef for your fixed-point type and you're safer.

When you want to do calculus on this value, you can use the 4 basic operators: +, -, * and /. You have to keep in mind that when adding and subtracting a value (+ and -), that value must also be shifted. Let's say we want to add 10 to our 500 price:

price += 10 << SHIFT_AMOUNT;

But for multiplication and division (* and /), the multiplier/divisor must NOT be shifted. Let's say we want to multiply by 3:

price *= 3;

Now let's make things more interesting by dividing the price by 4 so we make up for a non-zero fractional part:

price /= 4; // now our price is ((500 + 10) * 3) / 4 = 382.5

That's all about the rules. When you want to retrieve the real price at any point, you must right-shift:

printf("price integer is %d
", price >> SHIFT_AMOUNT);

If you need the fractional part, you must mask it out:

printf ("price fraction is %d
", price & SHIFT_MASK);

Of course, this value is not what we can call a decimal fraction, in fact it is an integer in the range [0 - 65535]. But it maps exactly with the decimal fraction range [0 - 0.9999...]. In other words, mapping looks like: 0 => 0, 32768 => 0.5, 65535 => 0.9999...

An easy way to see it as a decimal fraction is to resort to C built-in float arithmetic at this point:

printf("price fraction in decimal is %f
", ((double)(price & SHIFT_MASK) / (1 << SHIFT_AMOUNT)));

But if you don't have FPU support (either hardware or software), you can use your new skills like this for complete price:

printf("price is roughly %d.%lld
", price >> SHIFT_AMOUNT, (long long)(price & SHIFT_MASK) * 100000 / (1 << SHIFT_AMOUNT));

The number of 0's in the expression is roughly the number of digits you want after the decimal point. Don't overestimate the number of 0's given your fraction precision (no real trap here, that's quite obvious). Don't use simple long as sizeof(long) can be equal to sizeof(int). Use long long in case int is 32 bits as long long is guaranted to be 64 bits minimum (or use int64_t, int_least64_t and such, declared in stdint.h). In other words, use a type twice the size of your fixed-point type, that's fair enough. Finally, if you don't have access to >= 64 bits types, maybe it's time to exercice emulating them, at least for your output.

These are the basic ideas behind fixed-point arithmetics.

Be careful with negative values. It can becomes tricky sometimes, especially when it's time to show the final value. Besides, C is implementation-defined about signed integers (even though platforms where this is a problem are very uncommon nowadays). You should always make minimal tests in your environment to make sure everything goes as expected. If not, you can hack around it if you know what you do (I won't develop on this, but this has something to do with arithmetic shift vs logical shift and 2's complement representation). With unsigned integers however, you're mostly safe whatever you do as behaviors are well defined anyway.

Also take note that if a 32 bits integer can not represent values bigger than 2³² - 1, using fixed-point arithmetic with 2¹⁶ limits your range to 2¹⁶ - 1! (and divide all of this by 2 with signed integers, which in our example would leave us with an available range of 2¹⁵ - 1). The goal is then to choose a SHIFT_AMOUNT suitable to the situation. This is a tradeoff between integer part magnitude and fractional part precision.

Now for the real warnings: this technique is definitely not suitable in areas where precision is a top priority (financial, science, military...). Usual floating point (float/double) are also often not precise enough, even though they have better properties than fixed-point overall. Fixed-point has the same precision whatever the value (this can be an advantage in some cases), where floats precision is inversely proportional to the value magnitude (ie. the lower the magnitude, the more precision you get... well, this is more complex than that but you get the point). Also floats have a much greater magnitude than the equivalent (in number of bits) integers (fixed-point or not), to the cost of a loss of precision with high values (you can even reach a point of magnitude where adding 1 or even greater values will have no effect at all, something that cannot happen with integers).

If you work in those sensible areas, you're better off using libraries dedicated to the purpose of arbitrary precision (go take a look at gmplib, it's free). In computing science, essentially, gaining precision is about the number of bits you use to store your values. You want high precision? Use bits. That's all.

Categories

Fixed Point Arithmetic in C Programming

Fixed Point Arithmetic in C Programming

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags