James's Twisted Ape

Thoughts, code and anything else.

Comparing Floating Point Numbers

You often see the advice that you shouldn’t compare floating point numbers with the == operator, however it is rarely well explained, and often the alternative that is suggested to using == isn’t any better (and can be much worse).

The reason that people often say that you shouldn’t use the == operator is that floating point numbers don’t exactly represent the value that you set them to.

Looking at a decimal example, representing $\frac{1}{3}$ as a decimal floating point number gives $0.333333$. The floating point standard doesn’t have a notion of recurring numbers so the last place is rounded to a set precision. There may be an instance in our code where we want to find out if three times our number is equal to $1$, i.e. we may have code that does this:

1
2
3
4
5
6
7
8
decFloat a = 1.0 / 3;
...
if (3 * a == 1.0) {
  ...
} else {
  ...
}

Naïvely we may expect 3 * a to be 1.0 , however $0.333333 \cdot 3$ is $0.999999$ which is not equal to $1$ so the code will actually run into the else statement. In the case of binary floating point this example will actually work (i.e. (1.0f / 3) * 3 equals exactly 1.0). This is because multiplying by three causes our value to go above the next power of two, giving us more binary places of precision than a float can represent. This extra precision is used in the multiplication to round the answer up to 1.0. This isn’t the case for all examples of this type, for instance in binary (1.0f / 83) * 83 does not equal exactly 1.0 as the rounding ends up going in the other direction.

A solution to our problem is to choose a small value (often called epsilon) and say that if the difference between our values is less than epsilon then the test is passed. It’s very often the case that the results of floating point calculations are very close to the results that we expect (in fact right next to them on the floating point number line).

Choosing the Wrong Epsilon

As we only want the test to pass if we’re really close, choosing the smallest possible difference as our epsilon might lead to code that looks like this:

DON’T DO THIS IT’S WRONG!!

1
2
3
4
5
/* This is a broken floating point compare function, don't use it!! */
bool compareNearlyEqual (float a, float b) {
  return fabs (a - b) < FLT_EPSILON;
}

This will work great for values between $0.5$ and $2.0$, however it is broken for all numbers greater than $2$, for instance this will work:

1
2
3
4
5
6
7
float a = 1.0 / 83;
...
if (compareNearlyEqual (1, 83 * a)) {
  ...
} else {
  ...
}

However it won’t work for an example like this:

1
2
3
4
5
6
7
float a = 1.0 + 1.0 / 83;
...
if (compareNearlyEqual (84.0, 83 * a)) {
  ...
} else {
  ...
}

It fails to recognize 83 * a as close enough to 84.0. In this case there is no number between 83 * a and 84.0 i.e. 83 * a is the very next number that floating point numbers can represent, so we’d hope that our test would say that they are close enough.

The problem here is each time the you go past a power of $2$ the gap between each number that can be represented doubles. To see this imagine that we have a floating point representation that has a precision of 6 places. The number one would have a representation $1$ and the next value representable above one will have the value $1.00001$, so the gap between one and the next value is $0.00001$. Now if we do the same with two and the next number we get $10$ and $10.0001$. A gap of $0.0001$, which is twice as big as the gap between 1 and the next nearest value (which has four zeros before the $1$, not three).

In the coded example we’ve chosen FLT_EPSILON as our epsilon; this number is the gap between $1$ and the next nearest value. If the floating point numbers we are trying to compare are larger than $2$ the gap between any float and the next nearest float is larger than our FLT_EPSILON. This means that the numbers we are comparing are either identical (so we may as well have used ==) or the gap between them is larger than FLT_EPSILON so the test will fail (just as it would have done if we’d used ==). Here’s a piece of code demonstrating that this compare method is broken.

If your number is greater than 2 the above comparison function is identical to ==

So what Epsilon Should you Choose?

Due to floating point inaccuracy one of the most common problems is that you end up calculating a value that is right next to the value that you wanted. This means that the epsilon you want to choose is the gap between the larger float and its next nearest value. A value very close to this can be found easily by multiplying the larger of the two values by FLT_EPSILON. This leads to a floating point compare function like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
bool compareNearlyEqual (float a, float b) {
  float epsilon;
  /* May as well do the easy check first. */
  if (a == b)
    return true;

  if (a > b) {
    epsilon = a * FLT_EPSILON;
  } else {
    epsilon = b * FLT_EPSILON;
  }

  return fabs (a - b) < epsilon;
}

Here’s an example of this compare function working where the original compare function failed.

Why is the incorrect compare so prevalent?

The answer to this is that == works a lot better than most people think, so it usually doesn’t matter much that their compare function is almost always doing exactly the same thing as ==. For instance due to rounding the first time that (1.0f / n) * n doesn’t equal exactly $1$ is when n is $41$. There is a large and commonly used subset of floating point numbers where the representation of the number is exact. Any number of the form $\frac{a}{2^n}$ (note that $n$ can be zero) will have an exact representation (provided $a$ is representable within the precision of the standard being used). With 32 bit floats this includes all integers up to $2^{24}$. If your floating point number is exactly representable you can add, subtract and multiply them by other exact floating point numbers and the answer will remain exact (again provided the answer doesn’t go over your precision). It is also very common to be able to divide and have the result remain exact. With this important section of numbers you could use == without causing yourself any problems.

Basically == works so often that despite the wrong compare routine having exactly the same result as using == (for the values likely to be used) most of the time you’ll get exactly the result you expected, so you won’t notice.

What can still go wrong?

Floating point numbers can still end up farther apart than the epsilon that you have chosen. The easiest way to demonstrate this is with subtraction. For instance let’s work with 6 places of precision and we compare $10 \frac{1}{3} - 10$ with $\frac{1}{3}$. Working through $10.3333 - 10 = 0.333300$ comparing this with our compare function gives $0.333333 - 0.3333 = 0.000033$ this value is much bigger than the epsilon we’d calculate for value ($0.000001$). Really we need to use the epsilon that we would get for comparing the number $10.3333$, as when we do the subtraction we cut off the first few digits, and can’t just invent new digits to go at the end, so it is assumed that those digits are $0$. If possible it would be better to write this compare the other way round i.e. compare $10 \frac{1}{3}$ with $\frac{1}{3} + 10$.

You may decide that you want to increase the size of your epsilon so that it catches more cases. To do this it’s a good idea to multiply the epsilon in the function by a value (this could either be a constant or a value you pass in so you can change your epsilon for different situations).

No number is close to 0, this is in general true. The size of a value depends on your perspective of it. For instance if you’re dealing with numbers around a million, 1 looks really close to 0. If on the other hand you’re dealing with numbers close to one millionth 1 looks like a really enormous number. This means that if you want to compare numbers to 0 and you think the floats your comparing may not be exact, you need to choose a value which is you’re smallest non-zero value i.e., in this case you need to have a static epsilon. The epsilon for your code may be much smaller than FLT_EPSILON or it could be much larger, depending on the size of values that you are normally dealing with. As the floating point standard can’t represent all numbers it does actually have a value that is next to zero but this would normally be a very poor choice of epsilon for anything.

There are other times when doing a comparison that you may want to choose a static epsilon, particularly if you’re wanting to behave in a certain way depending on if you’re around a number. This isn’t really the same situation as trying to determine whether you have floating point inaccuracies.

Really Horrible Effects of Using a Constant Epsilon

It’s not great that using FLT_EPSILON as a constant epsilon reduces the compare function to == for values greater than 2 (I explored this because normally programmers are dealing with numbers of this sort of size). It’s even worse if your code has to deal with small numbers. The smaller your numbers get, the larger FLT_EPSILON is compared to these numbers. This means that if you’re doing calculations with numbers around $10^{-8}$ and smaller and you use a static epsilon of FLT_EPSILON all of your numbers will compare equal to each other so any comparison you do will be entirely pointless (it will always be true).

Conclusion

If you’re trying to solve floating point inaccuracy problems using an epsilon the epsilon has to be relative to the size of the numbers that you are comparing. If it isn’t then there is a good chance that your comparison function is pointless for most floating point numbers and dangerous to use with others.

There can be good reasons to use static epsilons, however you do need to think carefully about what the epsilon is going to be. The size and range of the floating point numbers you are using needs to have a strong bearing on the epsilon you choose. It’s incredibly rare that FLT_EPSILON is a good choice of static floating point epsilon. It may be OK if you’re only comparing numbers that are between $0.5$ and $2.0$ or as a close to zero number.