JoshJers' Ramblings

Infrequently-updated blog about software development, game development, and music

Posts tagged “floating point”

Emulating the FMAdd Instruction, Part 2: 64-bit Floats

(This post follows Part 1: 32-bit floats and will make very little sense without having read that one first. Honestly, it might make little sense having read that one first, I dunno!)

Last time we went over how to calculate the results of the FMAdd instruction (a fused-multiply-add calculated as if it had infinite internal precision) for 32-bit single-precision float values:

  • Calculate the double-precision product of a and b
  • Add this product to c to get a double-precision sum
  • Calculate the error of the sum
  • Use the error to odd-round the sum
  • Round the double-precision sum back down to single precision

This requires casting up to a 64-bit double-precision float to get extra bits of precision. But what if you can't do that? What if you're using doubles? You can't just (in most cases) cast up to a quad-precision float. So what do you do?

Double-Precision FMAdd

(Like the 32-bit version, this is based on Emulation of FMA and correctly-rounded sums: proved algorithms using rounding to odd by Sylvie Boldo and Guillaume Melquiond, there are additional details in there)

To do this natively as doubles, we need to invent a new operation: MulWithError. This is the multiplication equivalent of the AddWithError function from the 32-bit solution:

(double prod, double err) MulWithError(double x, double y)
{
  double prod = x * y;
  double err = // ??? how do we do this
  return (prod, err);
}

We'll get to how to implement that in a moment, but first we'll walk through how to use that function to calculate a proper FMAdd.

We need to do the following:

  • Calculate the product of a and b and the error of that product
  • Calculate the sum of that product and c (giving us a * b + c) and the error of this sum
    • We're not using the error of the product ... yet
  • Add the two error terms (product error and sum error) together, rounding the result to odd
  • Add this summed error term to our actual result, which will round normally.

In code, that looks like this:

// Start with an "OddRoundedToAdd" helper since we do
//  this operation frequently
double OddRoundedAdd(double x, double y)
{
  (double sum, double err) = AddwithError(x, y);
  return RoundToOdd(sum, err);
}

double FMAdd(double a, double b, double c)
{
  (double ab, double abErr) = MulWithError(a, b);
  (double abc, double abcErr) = AddWithError(ab, c);

  // Odd-round the sum of the two errors before 
  //  adding it in to the final result.
  double err = OddRoundedAdd(abErr, abcErr);
  return abc + err;
}

By keeping the error terms from both the product and the sum, we have kept all of the exact result. That is, we can assemble the mathematically-exact result given enough precision by doing abc + abErr + abcErr.

But we can't do infinite-precision addition of three values. However, we can odd-round an intermediate result, the same way we did with the single-precision case.

In this case, we know that abErr and abcErr both (necessarily) have much lower magnitudes than the final result, as each error value is a value lower than the lowest bit of the mantissa of their respective operations. So, if we odd-round the sum of these two values, it actually fulfills the condition of having more bits of precision than the final result, virtually, since the error value's bits are way below the mantissa of the result. Thus, if we add the error terms together with odd rounding, the odd-rounded fake-sticky final digit will be taken into account by the actual sticky bit used when doing the final sum of the result and error terms.

I hope that makes sense?

Breaking Down Multiplication With Error

(Like AddWithError, this is based on A Floating-Point Technique For Extending the Available Precision by T.J. Dekker)

So how do we calculate the error term of a 64-bit multiply? We can't use 128-bit values, but what we can do is break each 64-bit value up into two values, each with less bits of precision.

We'll break x and y (our two multiplicands) up into high and low values, where:

x = xh + xl;
y = yh + yl;

We do this by breaking the double's mantissa up:

  • xh contains the top 25 bits of x's mantissa (plus its implied 1, giving it 26 bits of precision).
  • xl contains the bottom 27 bits of x's mantissa. The highest-set '1' in this mantissa will become the implied 1 bit that's part of the standard floating-point format, so this value will have 27 bits of precision, max.
  • yh and yl are the same, but for y.
(double h, double l) Split(double v)
{
  // In C++ this Zero function can be implemented by  masking off
  //  the bottom 27 bits by casting to a 64-bit int:
  // constexpr uint64_t Mask = ~0x07ff'ffff;
  // double h = std::bit_cast<double>(
  //              std::bit_cast<uint64_t>(v) & Mask);
  double h = ZeroBottom27BitsOfMantissa(v);

  // We can get the lower bits of the mantissa (correctly normalized and
  //  with correct signs) by subtracting the extracted upper bits from 
  //  the original value.
  double l = v - h;
  return (h, l);
}

What does this split value give us? Well we can now break up the multiplication into a sequence of multiplies and adds of values that each have less bits of precision by using our old friend from Algebra, FOIL:

    x * y 
= (xh + xl) * (yh + yl) 
= xh*yh + xh*yl + xl*yh + xl*yl;

We can't actually do those adds directly, but what we can do is similar to how we did AddWithError: use a sequence of precision-preserving operations to calculate the difference between that idealized result and our rounded product:

(double prod, double err) MulWithError(double x, double y)
{
  double prod = x * y;

  (xh, xl) = Split(x);
  (yh, yl) = Split(y);

  // Parentheses to demonstrate the precise order these 
  //  operations must occur in
  double err = (((xh*yh - prod) + xh*yl) + xl*yh) + xl*yl;
  return (prod, err);
}

It works like this:

  • Calculate the (rounded) product of x and y
  • Subtract that rounded product from the product of xh and yh
    • These should have roughly the same magnitude (and definitely the same sign) so this is a precision-preserving subtraction.
    • Since |xh * yh| <= rounded(|x * y|) (because xh and yh are truncated versions of x and y and thus have lower magnitudes) this is a smaller - larger operation and we'll get a result with a sign opposite that of the final product.
  • Keep adding in next-lower-magnitudes of values, which will continue to preserve precision
    • (because we have a value that is opposite-sign these are effectively subtractions, in the same way that a + -b is)
    • It's also worth noting here that xh*yl and xl*yh will have equivalent magnitudes so the order that you add them in doesn't matter, as long as they're both after xh*yh and before xl*yl

Once you've done that, you have the computed product as well as the error term, and we can then follow our FMAdd algorithm above to calculate the FMAdd.

So, that's it, we're done, right?

Edge Cases

Nope! Well, yes if you just wanted the gist, but now it's time to get into all those annoying implementation details that the papers this is based on completely glossed over. Here's where it gets ugly (unless you thought it was already ugly, in which case, sorry, it's about to get worse somehow).

In our single-precision case, we didn't have to worry about exponent overflow or underflow because we were using double-precision intermediates, which not only have additional mantissa range, but also additional exponent range.

It's possible that the product of a * b (an intermediate value in our calculation) goes out of range of what a double can represent, but that the addition of c might bring the final result back into range (which can happen when the sign of c is opposite the sign of a * b). This causes a different set of errors on either end:

  • If a*b is too large to be represented, it turns into infinity which means that adding c in will just leave it as infinity even though the final result should have been a representable value (albeit one with a very large magnitude)
  • If a*b is too small to be represented, it will go subnormal which means bits of the intermediate result will slide off the bottom of the mantissa and we lose bits of information, which can causes us to round incorrectly at our final result.

To solve this, we'll introduced a bias into the calculation, for when the value goes very small or very large:

double CalculateFMAddBias(double a, double b, double c)
{
  // Calculate what our final result would be if we just did it normally
  double testResult = Abs(a * b + c);

  if (testResult < Pow2(-500) && Max(a, b, c) < Pow2(800))
  {
    // Our result is very small and our maximum value is not so large 
    //  that we'll blow up with a bias, so bias our values up to
    //  ensure we don't go subnormal in our intermediate result
    return Pow2(110);
  }
  else if (IsInfinite(testResult))
  {
    // We hit infinity, but that might be due to exponent overflow,
    //  so bias everything down (this may cause c to go subnormal, 
    //  but if that's the case then a*b on its own is infinity and
    //  so it won't affect the final result in any way)
    return Pow2(-55);
  }
  else
  {
    // No bias needed
    return 1.0;
  }
}

For any results that aren't extreme, the bias will remain 1.0 but, for values at the extremes, we'll scale our intermediates down (using powers of 2 which only affect the exponent and not the mantissa) into a range such that we can't temporarily poke outside of range. Also note that my choices of powers of 2 are not perfectly chosen, I didn't bother trying to figure out the exact right biases/thresholds so I just picked ones that I knew were good enough.

So then we do our FMAdd calculation as before, but with the bias introduced (and then backed out at the end):

// Do our multiplication with the bias applied to 'a'
//  (the choice of applying it to 'a' vs 'b' is completely
//  arbitrary)
(double ab, double abErr) = MulWithError(a * bias, b);

// Then the sum with the bias applied to 'c'
(double abc, double abcErr) = AddWithError(ab, c * bias);

err = OddRoundedAdd(abErr, abcErr);

// Calculate our final result then un-bias the result.
return (abc + err) / bias;

Alright, we've avoided both overflow and underflow and everything is great, right?

Two (Point Five) Last Annoying Implementation Details

Nope, sorry again! It turns out there are still two cases we need to deal with.

Case 1: Infinity or NaN even with the bias

If our result (without error applied) hits infinity even with the avoid-infinity bound, then we should just go ahead and return now to avoid Causing Problems Later (that is, turning what should be infinity into a NaN). And if it's already NaN we can just return now because it's going to be NaN forever:

if (IsInfiniteOrNaN(abc))
  { return abc; }

Case 2: Subnormal Results

If our result is subnormal (after the bias is backed out), then it's going to lose bits of precision as it shifts down (because the exponent can't go any lower so instead the value itself shifts down the mantissa), which means whoops here's another rounding step, and the dreaded double-rounding returns.

In this case we need to actually odd-round the addition of the error term as well, so that when the bias is backed out and it rounds, it does the correct thing:

// Multiply the smallest-representable normalized value by our avoid-
//  subnormal bias. Any (biased) value below this will go subnormal.
//  (In production code it'd be nicer to use something like
//  std::numeric_limits instead of hard-coding -1022)
const double SubnormThreshold = Pow2(-1022) * AvoidDenormalBias;

if (bias == AvoidSubnormalBias && Abs(abc) < SubnormThreshold)
{
  // Odd-round the addition of the error so that the rounding that 
  //  happens on the divide by the bias is correct.
  (double finalSum, finalSumErr) = AddWithError(abc, err);
  finalSum = RoundToOdd(finalSum, finalSumErr);
  return finalSum / bias;
}

And this almost works, except there's one more annoying case, and that's where our result is going subnormal, but only by exactly one bit. Remember that the odd-rounding trick only works if we have two or more bits so that the final rounding works properly, but in this case we're truncating the mantissa by exactly one bit, so we have to do even more work:

  • Split the value that will be shifting down into a high and low part (same as we did for the multiply)
  • Add our error term to the low part of it
    • This preserves additional bits of the error term since we gave ourselves more headroom by removing the upper half of its mantissa
  • Remove the bias from both the high and low parts separately
    • Removing the bias from the high part doesn't round since we know the lowest bit is 0
    • Removing the bias from the low part applies the actual final rounding (correctly) since we gave ourselves more bits to work with
  • Sum the halves back together and return that as our final result
    • This sum is (thankfully) perfectly representable by the final precision and doesn't introduce any additional error.
const double OneBitSubnormalThreshold = 
  OneBitSubnormalThreshold * 0.5;
if (Abs(finalResult.result) >= k_oneBitDenormThreshold)
{
  // Split into halves
  (rh, rl) = Split(finalSum);

  // Add the error term into the low part of the split
  rl = OddRoundedAddition(rl, finalSumErr);

  // Scale them both down by the bias. Note that 
  //  the rh division cannot round since the lowest bit
  //  is 0
  rh /= bias;

  // This division is what actually introduces the final
  //  rounding (correctly, since we gave ourselves more
  //  bits to work with)
  rl /= bias;

  // This sum is perfectly representable by the final
  //  precision and will not introduce additional error.
  return rh + rl;
}

OMG Are We Done Now?

As far as I'm aware, those are all the implementation details to doing a 64-bit double-precision FMAdd implementation. It's conceptually not that much more complicated than the 32-bit one, but mechanically it's worse, plus there are those fun extra edge cases to consider.

Here's the final code:

(double h, double l) Split(double v)
{
  double h = ZeroBottom27BitsOfMantissa(v);
  double l = v - h;
  return (h, l);
}

(double prod, double err) MulWithError(
  double x, 
  double y)
{
  double prod = x * y;

  (xh, xl) = Split(x);
  (yh, yl) = Split(y);
  double err = 
    (((xh*yh - prod) + xh*yl) + xl*yh) + xl*yl;
  return (prod, err);
}

double OddRoundedAdd(double x, double y)
{
  (double sum, double err) = AddwithError(x, y);
  return RoundToOdd(sum, err);
}

double FMAdd(double a, double b, double c)
{
  const double AvoidSubnormalBias = Pow2(110);
  double bias = 1.0;
  {
    // Calculate our final result as if done normally
    double testResult = Abs(a * b + c);

    // Bias if the result goes too low or too high
    if (testResult < Pow2(-500) && Max(a, b, c) < Pow2(800))
      { bias = AvoidSubnormalBias; } // too low
    else if (IsInfinite(testResult))
      { bias = Pow2(-55); } // too high
  }

  // Calculate using our bias
  (double ab, double abErr) = MulWithError(a * bias, b);
  (double abc, double abcErr) = AddWithError(ab, c * bias);

  // Check for infinity or NaN and return early
  if (IsInfiniteOrNaN(abc))
    { return abc; }

  // Odd-round the intermediate error resultt
  double err = OddRoundedAdd(abErr, abcErr);

  // Multiply the smallest-representable normalized value by our avoid-
  //  subnormal bias. Any (biased) value below this will go subnormal
  const double SubnormThreshold = Pow2(-1022) * AvoidSubnormalBias;

  if (bias == AvoidSubnormalBias && Abs(abc) < SubnormThreshold)
  {
    (double finalSum, finalSumErr) = AddWithError(abc, err);

    // This is half of SubnormThreshold. Any value between SubnormThresold
    //  and this value will only lose a single bit of precision when
    //  the bias is removed, which requires some extra care
    const double OneBitSubnormalThreshold = 
      OneBitSubnormalThreshold * 0.5;
    if (Abs(finalSum) >= OneBitSubnormalThreshold)
    {
      // Split into halves
      (rh, rl) = Split(finalSum);

      // Add the error term into the LOW part of our split value
      rl = OddRoundedAdd(rl, finalSumErr);

      // Divide out the bias from both halves (which will cause rl to
      //  round to its final, correctly-rounded value) then sum them 
      //  together (which is perfectly representable).
      rh /= bias;
      rl /= bias;
      return rh + rl;
    }
    else
    {
      // For more-than-one-bit subnormals, we do an odd-rounded addition of
      //  the error term and then divide out the bias, doing full rounding
      //  just once.
      finalSum = RoundToOdd(finalSum, finalSumErr);
      return finalSum / bias;
    }
  }
  else
  {
    // Not subnormal, so we can calculate our final result normally and un-
    //  bias the result.
    return (abc + err) / bias;
  }
}

Compare that to the 32-bit version and you can see why this one got its own post:

float FMAdd(float a, float b, float c)
{
  double product = double(a) * double(b);
  (double sum, double err) = AddWithError(product, c);
  sum = RoundToOdd(sum, err);
  return float(sum);
}

Hopefully you never have to actually implement this yourself, but if you do? I hope this helps.

Emulating the FMAdd Instruction, Part 1: 32-bit Floats

A thing that I had to do at work is write an emulation of the FMAdd (fused multiply-add) instruction for hardware where it wasn't natively supported (specifically I was writing a SIMD implementation, but the idea is the same), and so I thought I'd share a little bit about how FMAdd works, since I've already been posting about how float rounding works.

So, screw it, here we go with another unnecessarily technical, mathy post!

What is the FMAdd Instruction?

A fused multiply-add is basically doing a multiply and an add as a single operation, and it gives you the result as if it were computed with infinite precision and then rounded down at the final result. FMAdd computes (a * b) + c without intermediate floating-point error being introduced:

float FMAdd(float a, float b, float c)
{
  // ??? Somehow do this with no intermediate rounding
  return (a * b) + c; 
}

Computing it normally (using the code above) for some values will get you double rounding (explained in a moment) which means you might be an extra bit off (or, more formally, one ULP) from where your actual result should be. An extra bit doesn't sound like a lot, but it can add up over many operations.

Fused multiply-add avoids this extra rounding, making it more accurate than a multiply followed by a separate add, which is great! (It can also be faster if it's supported by hardware but, as you'll see, computing it without a dedicated instruction on the CPU is actually surprisingly spendy, especially once you get into doing it for 64-bit floats, but sometimes you need precision instead of performance).

Double Rounding

Double rounding happens when the intermediate value rounds down (or up), then the final result also rounds in the same direction - but because of the first rounding, actually overshoots the correctly-rounded final value by a bit.

Here's an example using two successive sums of some 4-bit float values. We'll do the following sum (in top-down order):

  1.000 * 2^4
+ 1.001 * 2^0
+ 1.100 * 2^0

The first sum, done with "infinite" internal precision, looks like this:

  1.000 * 2^4
+ 1.001 * 2^0
  -----------
  1.000 0000 * 2^4
+ 0.000 1001 * 2^4
  ----------------
  1.000 1001 * 2^4

If we were to then use that result directly (with no intermediate rounding) and do the second sum, only rounding the final result:

   1.000 1001 * 2^4
 + 0.000 1100 * 2^4
   ----------
 = 1.001 0101 * 2^4
-> 1.001      * 2^4 // Rounds down

The final result rounds (to nearest) to 1.001.

However, if we were to round that intermediate value to 4 bits first, we'd get this:

   1.000 1001 * 2^4
-> 1.001      * 2^4 // Rounded up
 + 0.000 1100 * 2^4
   ----------
   1.001 1100 * 2^4
-> 1.010      * 2^4 // Up again

In this one, we end up with 1.010 instead of 1.001 because of the intermediate rounding, which pushed us past the correctly-rounded final result.

How to Pretend That You Have Infinite Precision

Okay, for FMAdd we want to calculate a multiply, and then somehow throw an add in there and have it act as if we didn't lose any precision on the multiply.

First we're going to handle the case of 32-bit floats (singles) because it's a wildly simpler case on CPUs that have 64-bit floats (doubles).

(also, sorry in advance, the term "double" for a "double-precision float" and the "double" in "double rounding" are two different instances of "double" but I've written so much of this post and like hell am I changing it now so hopefully it's not too confusing)

The immediately obvious thing to try to get an accurate single-precision FMA is "hey, what if we do the multiply and add as doubles and then round the result back down to a single":

float FMAdd(float a, float b, float c)
{
  // Do the math as 64-bit floats and truncate at the end. 
  //  Surely that's good enough, right?
  return float((double(a) * double(b)) + double(c)); 
}

While that gives a much better result than doing it as pure 32-bits, it actually can still have double rounding. But where does the extra rounding come from, in this case?

The multiply itself isn't the source of the first rounding: Surprisingly (to me, at least): casting two singles to doubles and multiplying those together always results in an exact answer - this is because each of the single-precision values has 24 bits of precision, but a double can store 53 bits of precision, which is more than enough to store the result of multipling two singles (2 * 24 bits of precision max). Since floats are stored as:

sign * 1.mantissa * 2^(exponent)

...it means we're multiplying two numbers of the form 1.xxxxxxxxxx and 1.yyyyyyyyy together then adding the exponents together to get the new number, so unlike addition and subtraction (where, say, 1 + 1*10^60 requires a ton of extra precision), if two float numbers have wildly different exponents it doesn't actually matter because the exponents and significand values are handled separately.

To illustrate this, let's pretend we have two 4-digit (base 10) numbers and we multiply them and store the result using 8 digits (double precision):

  (1.234 * 2^1) * (1.457 * 2^100)
-> (1.2340000 * 1.4570000) * (2^1 * 2^100)
 =  1.7979380 * 2^101  // no rounding here!

Great, so the double-precision multiply is fine and introduces no rounding at all. So then how do we get double rounding?

As mentioned above, an add (or subtract) can introduce rounding:

  (1.234 * 10^0) + (1.457 * 10^9)
-> (1.2340000 * 10^0) + (1.4570000 * 10^9)
 = (1.45700001234 * 10^9) // Too many digits!
-> 1.45700000 // Rounded to nearest here

This rounding happens at double precision (so well below the threshold of our target 32-bit result), but there's still rounding, and then the value is rounded again when converted back down to single precision. That's the double rounding and the source of a potential error.

Okay, so, double rounding is bad? Kinda! But it turns out there is a way to introduce a new rounding mode to use for the first rounding that, in the right situations, does not introduce any additional error and ensures that your final result is correct.

A New Rounding Mode?

(This technique is based off of the paper Emulation of FMA and correctly-rounded sums: proved algorithms using rounding to odd by Sylvie Boldo and Guillaume Melquiond; if you want the full technical details of this whole process, that's where you'll find them. Believe it or not, I'm actually trying to go into less detail!)

The key to eliminating the extra precision loss is by using a non-standard rounding mode: rounding to odd. Standard floating point rounding calculates results with some additional bits of precision (three bits, to be precise), and then rounds based on the result (usually using "round to nearest with round to even on ties", although that detail doesn't end up mattering here - this technique works with any standard rounding mode).

So, assume that we have some way of calculating a double precision addition and also having access to the error between the calculated result and the mathematically exact result. Given those two values we can perform a Round To Odd step:

double RoundToOdd(double value, double errorTerm)
{
  if (errorTerm != 0.0 // if the result is not exact
    && LowestBitOfMantissa(value) == 0) //and mantissa is even
  {
    // We need to round, so round either up or down to odd
    if (errorTerm > 0)
    {
      // Round up to an odd value
      value = AddOneBitToMantissa(value);
    }
    else // (errorTerm < 0)
    {
      // Round down to an odd value
      value = SubtractOneBitFromMantissa(value);
    }
  }
}

Basically: if we have any error at all, and the mantissa is currently even, either add or subtract a single bit's worth of mantissa, based on the sign of the error.

(In practice, I found that I also had to ensure the result was not Infinity before doing this operation, since I implemented this using some bitwise shenanigans that would end up "rounding" Infinity to NaN, so, you know, watch out for that).

Why Does Odd-Rounding the Intermediate Value Work?

Round to odd works as long as we have more bits of value than the final result - specifically we need at least two extra bits. Standard float rounding makes use of something called a "sticky" bit - basically the lowest bit of the extra precision is a 1 if any of the bits below it would have been 1.

And, hey, that is basically what "round to odd" does!

  • If the mantissa is odd, regardless of whether there's error or not the lowest bit is already odd.
  • If the error was positive and the mantissa was even, we set the lower bit to 1 anyway, effectively stickying (yeah that's a word now) all the error bits below it.
  • If the error was negative and the mantissa was even, we subtract 1 from the mantissa, making the lower bit odd, and effectively sticky since some of the digits below it are also 1s.

Effectively, round to odd is just "emulate having a sticky bit at the bottom of your intermediate result" - that way, you have a guaranteed tiebreaker for the final rounding step.

But note that I said it requires you to have at least two extra bits. In the case of our using-doubles-instead-of-singles intermediate addition, good news: we have way more than two extra bits - our intermediate value is a whole-ass double-precision float, so we have 29 extra bits vs. our single-precision final value and (mathematically speaking) 29 is greater than 2.

So, for the true single-precision FMAdd instruction we need to do the following:

float FMAdd(float a, float b, float c)
{
  double product = double(a) * double(b); // No rounding here
  
  // Calculate our sum, but somehow get the error along with it
  (double sum, double err) = AddWithError(product, c);

  // Round our intermediate value to odd
  sum = RoundToOdd(sum, err);

  // Final rounding here, which now does the correct thing and gives us
  //  a properly-rounded final result (as if we'd used infinite bits)
  return float(sum);
}

That's it! ...wait, what's that AddWithError function, we haven't even--

Calculating An Exact Addition Result

Right, we need to calculate that intermediate addition along with some accurate error term. It turns out it's possible to calculate a set of numbers, sum and error where mathematicallyExactSum = sum + error.

For this, we have to dive back to July of 1971 and check out the actually typewritten paper A Floating-Point Technique For Extending the Available Precision by T.J. Dekker. Give that a read if you want way more details on this whole thing.

Calculating the error term of adding two numbers (I'll use x and y) is relatively straightforward if |x| > |y|:

sum = x + y;
err = y - (sum - x);

(this is equation 4.14 in the linked paper)

This is just a different ordering of (x + y) - sum that preserves accuracy: due to the nature of the values involved in these subtractions (sum's value is directly related to those of x and y, and y is smaller than x), it turns out that each of those subtractions is an exact result (the paper has a proof of this, and it's a lot so I'm not going to expand on that here), so we get the precise difference between the calculated sum and the real sum.

But this only works if you know that x's magnitude is larger than (or equal to) y's. If you don't know which of the two values has a larger magnitude, you can do a bit more work and end up with:

(double sum, double err) AddWithError(
  double x, 
  double y)
{
  double sum = a + b;
  double intermediate = sum - x;
  double err1 = y - intermediate;
  double err2 = x - (sum - intermediate);
  return (sum, err1 + err2);      
}

(This is effectively the expanded version of listing 4.16 from the linked paper)

  • err1 here is the same as the value in the first version we calculated (a precision-preserving rewrite of (x + y) - sum)
  • err2 is, mathematically, x - (sum - (sum - x)) or 0; its goal is to calculate the error involved in calculating err1, since without the |x| > |y| guarantee those subtractions might NOT be exact ... but these ones will be.
  • Thus, summing these two error terms together gives us a final, precise error term.

(More details in the paper, hopefully this isn't too glossed over that it loses any meaning)

Finally, the End (For Single-Precision Floats)

So, yeah, that's how you implement the FMAdd instruction for single-precision floats on a machine that has double-precision support:

  • Calculate the double-precision product of a and b
  • Add this product to c to get a double-precision sum
  • Calculate the error of the sum
  • Use the error to odd-round the sum
  • Round the double-precision sum back down to single precision

But what if you have to calculate FMAdd for double-precision floats? You can't easily just cast up to, like, quad-precision floats and do the work there, so what now? Can you still do this?

The answer is yes, but it's a lot more work, and that's what the next post is about.

C++17-Style Hex Floats (And How To Parse Them)

C++17 added support for hex float literals, so you can put more bit-accurate floating point values into your code. They're handy to have, and I wanted to be able to parse them from a text file in a C# app I was writing.

Some examples:

0x1.0p0         // == 1.0
0x1.8p1         // == 3
0x8.0p-3        // == 1.0
0x0.8p1         // == 1.0
0xAB.CDEFp-10   // == 0.16777776181697846
0x0.0000000ABp0 // == 2.4883775040507317E-09

What Am I Even Looking At Here?

(Before reading this, if you don't have a good feel for how floats work, maybe check my previous post about floats (and float rounding))

I had a bit of a mental block on this number format for a bit - like, what does it even mean to have fractional hex digits? But it turns out it's a concept that we already use all the time and my brain just needed some prodding to make the connection.

With our standard base 10 numbers, moving the decimal point left one digit means dividing the number by 10:

12.3 == 1.23 * 10^1 == 0.123 * 10^2

Hex floats? Same deal, just in 16s instead:

0x1B.C8 == 0x1.BC8 * 16^1 == 0x0.1BC8 * 16^2

Okay, so now what's the "p" part in the number? Well, that's the start of the exponent. A standard float has an exponent starting with 'e':

1.3e2 == 1.3 * 10^2

But 'e' is a hex digit, so you can't use 'e' anymore as the exponent starter, so they chose 'p' instead (why not 'x', the second letter? Probably because a hex number starts with '0x', so 'x' also already has a use - but 'p' is free so it wins)

The exponent for a hex float is in powers of 2 (so it corresponds perfectly to the exponent as it is stored in the value), so:

0x1.ABp3 == 0x1.AB * 2^3

So that's how a hex float literal works! Here's a quick breakdown:

<hex-digits>[ '.' [fractional-hex-digits]] 'p' <exponent>
(Parsing code below the fold)

Okay Now What's This About Parsing Them?

C++ conveniently has functions to parse these (std::strtod/strtof handle this nicely). However, if you're (hypothetically) making a parser, and you happen to be writing it in C# which does not have an inbuilt way to parse these, then you'll have to parse your own.

It ended up being a little more complicated than I thought for a couple reasons.

  • Arbitrarily long hex strings seem to be supported, which means you need to track both where the top-most non-zero hex value starts (i.e. skip any leading zeros) but also properly handle bits that are extra tiny which may affect rounding.
  • In order to properly handle said rounding, running through the hex digits in reverse and pushing them in from the top ends up being a nice strategy, because float rounding works via a sticky bit that stays set as things right-shift down through it.

Ultimately I ended up with the following algorithm (in C#) to parse a hex float, which I'm definitely sure is ~perfect~ and has absolutely no bugs whatsoever. It is also absolutely the most efficient version of this possible, with no room for improvement. Yep.

I'm throwing it in here in case anyone ever finds it useful.

static bool IsDigit(char c) => (c >= '0' && c <= '9');
static bool IsHexDigit(char c) 
  => IsDigit(c) 
    || (c >= 'A' && c <= 'F') 
    || (c >= 'a' && c <= 'f');

double ParseHexFloat(string s)
{ 
  // This doesn't handle a negative sign, if only because the parser I have only
  // needed to support positive values, but it'd be easy to add
  if (s.Length < 2 || s[0] != '0' || char.ToLowerInvariant(s[1]) != 'x')
    { throw new FormatException("Missing 0x prefix"); }

  if (s.Length < 3 || !IsHexDigit(s[2]))
    { throw new FormatException(
        "Hex float literal must contain at least one whole part digit"); }
    
  int i = 2;
  int decimalPointIndex = -1;
  int firstNonZeroHexDigitIndex = -1;

  // Scan through our digits, looking for the index of the first set (non-zero)
  //  hex value and the decimal point (if we have one).
  while (i < s.Length && (IsHexDigit(s[i]) || s[i] == '.'))
  {
    if (s[i] == '.')
    {
      // Found the decimal point! Hopefully there wasn't already one!
      if (decimalPointIndex >= 0)
        { throw new FormatException("Too many decimal points"); }

      decimalPointIndex = i;
    }
    else if (s[i] != '0' && firstNonZeroHexDigitIndex < 0)
      { firstNonZeroHexDigitIndex = i; } // Here's our top-most set hex value.

    i++;
  }

  // Also make a note of where our last hex digit was (usually the digit before
  //  the 'p' that we should be at right now)
  int lastHexDigitIndex = i - 1;

  // ... but if the previous character was the decimal point, the last hex
  //  digit is before that.
  if (lastHexDigitIndex == decimalPointIndex)
    { lastHexDigitIndex--; }

  // If we didn't find a decimal point, it's EFFECTIVELY here, at the 'p'
  if (decimalPointIndex < 0)
    { decimalPointIndex = i; }

  // Validate and skip the 'p' character
  if (i >= s.Length || char.ToLowerInvariant(s[i]) != 'p')
    { throw new FormatException("Missing exponent 'p'"); }
  i++;

  // Grab the sign if we have one
  bool negativeExponent = false;
  if (i < s.Length && (s[i] == '+' || s[i] == '-'))
  {
    negativeExponent = (s[i] == '-');
    i++;
  }

  if (i >= s.Length)
    { throw new FormatException("Missing exponent digits"); }

  // Parse the exponent!
  int exponent = 0;
  while (i < s.Length && IsDigit(s[i]))
  {
    if (int.MaxValue / 10 < exponent)
      { throw new FormatException("Exponent overflow"); }
    
    exponent *= 10;
    exponent += (int)(s[i] - '0');
    i++;
  }

  if (negativeExponent)
    { exponent = -exponent; }

  // If we had no non-zero hex digits, there's no point in continuing, it's
  //  zero. 
  if (firstNonZeroHexDigitIndex < 0)
    { return 0.0; }
    
  if (i != s.Length)
    { throw new FormatException("Unexpected characters at end of string"); }
    
  // We have the supplied exponent, but we want to massage it a bit. In a
  //  (IEEE) floating-point value, the mantissa is entirely fractional - that
  //  is, the value is 1.mantissa * 2^(exponent) - there's an implied 1
  //  (excluding subnormal floats, which we'll handle properly, but we can
  //  ignore for the moment). Two things need to happen here:
  //   1. We need to adjust the exponent based on the position of the first
  //      non-zero hex digit, to match the fact that we're parsing hex digits
  //      such that the top hex digit is sitting in the top 4 bits of our 64-
  //      bit int.
  //   2. But we EXPECT a single bit to be above the mantissa (the implied
  //      1) so subtract 1 from our adjustment to take into account that
  //      there will be 4 bits in that hex, so if we had parsed a single "1"
  //      (from, say, 0x1p0, which just equals 1.0) our effective exponent
  //      should be 3 (which we will later shift back down to 0 to position
  //      the 1s bit at the very top)
  if (decimalPointIndex >= firstNonZeroHexDigitIndex)
    { exponent += ((decimalPointIndex - firstNonZeroHexDigitIndex) * 4) - 1; }
  else
    { exponent += ((decimalPointIndex - firstNonZeroHexDigitIndex + 1) * 4) - 1; }

  // Now that we have the exponent and know the bounds of our hex digits,
  //  we can parse backwards through the hex digits, shifting them in from
  //  the top. We do this so that we can easily handle the rounding to the
  //  final 53 bits of significand (by ensuring that we don't ever shift
  //  any 1s off the bottom)
  ulong mantissa = 0;
  for (i = lastHexDigitIndex; i >= firstNonZeroHexDigitIndex; i--)
  {
    // Skip the '.' if there was one.
    if (s[i] == '.')
      { continue; }

    char c = char.ToLowerInvariant(s[i]);
    ulong v = (c >= 'a' && c <= 'f') 
      ? (ulong)(c - 'a' + 10) 
      : (ulong)(c - '0');

    // Shift the mantissa down, but keep any 1s that happen to be in the bottom
    //  4 bits (this is a reasonably-efficient emulation of the "sticky bit"
    //  that is used to round a floating point number properly.
    mantissa = (mantissa >> 4) | (mantissa & 0xf);

    // Add in our parsed hex value, putting its 4 bits at the very top of the
    //  mantissa ulong.
    mantissa |= (v << 60);
  }

  // We know the mantissa is non-zero (checked earlier), and we want to position
  //  the highest set bit at the top of our ulong so shift up until the top bit
  //  is set (and adjust our exponent down 1 to compensate).
  while ((mantissa & 0x8000_0000_0000_0000ul) == 0)
  {
    mantissa <<= 1;
    exponent--;
  }

  const int DoubleExponentBias = 1023;
  const int MaxBiasedDoubleExponent = 1023 + DoubleExponentBias;
  const ulong MantissaMask = 0x000f_ffff_ffff_fffful;
  const ulong ImpliedOneBit = 0x0010_0000_0000_0000ul;
  const int ExponentShift = 52;
  const int MantissaShiftRight = sizeof(double)*8 - ExponentShift - 1;

  // Exponents are stored in a biased form (they can't go negative) so add our
  //  bias now.
  exponent += DoubleExponentBias;

  if (exponent <= 0)
  {
    // We have a subnormal value, which means there is no implied 1, so first
    //  we need to shift our mantissa down by one to get rid of the implied 1.
    //  (note that we're not letting any 1s shift off the bottom, keeping them
    //  sticky)
    mantissa = (mantissa >> 1) | (mantissa & 1);

    // Continue to denormalize the mantissa until our exponent reaches zero
    while (exponent < 0)
    {
      mantissa = (mantissa >> 1) | (mantissa & 1);
      exponent++;
    }
  }

  // Now to do the actual rounding of the mantissa. Sometimes floating point
  //  rounding needs 3 bits (guard, round, sticky) to do rounding, but in our
  //  case, two will suffice: one bit that represents the uppermost bit that
  //  shifts right off of the edge of the mantissa (i.e. the "0.5" bit) and
  //  then "literally any 1 bit underneath that" (which is why we've been
  //  holding on to extra 1s when shifting right) that is the tiebreaker
  bool roundBit      = (mantissa & 0b10000000000) != 0;
  bool tiebreakerBit = (mantissa & 0b01111111111) != 0;

  // Now that we have those bits, we can shift our mantissa down into its
  //  proper place (as the lower 53 bits of our 64-bit ulong).
  mantissa >>= MantissaShiftRight;
  if (roundBit)
  {
    // If there's a tiebreaker, we'll increment the mantissa. Otherwise,
    //  if there's a tie (could round either way), we round so that the
    //  mantissa value is even (lowest bit in the double is 0)
    if (tiebreakerBit || (mantissa & 1) != 0)
      { mantissa++; }

    // If we have a subnormal float we may have overflowed into the implied 1
    //  bit, otherwise we might have overflowed into the ... I guess the
    //  implied *2* bit?
    ulong overflowedMask = ImpliedOneBit << ((exponent == 0) ? 0 : 1);
    if ((mantissa & overflowedMask) != 0)
    {
      // Shift back down one. This is not going to drop a 1 off the bottom
      //  because if we overflowed it means we were odd, and added one to
      //  become even.
      exponent++;
      mantissa >>= 1;
    }
  }

  // It's possible that the truncation we ended up with a 0 mantissa after all,
  //  so our final value has rounded allll the way down to 0.
  if (mantissa == 0)
    { return 0.0; }

  // If our exponent is too large to be represented, this value is infinity.
  if (exponent > MaxBiasedDoubleExponent)
    { return double.PositiveInfinity; }

  // Mask off the implied one bit (if we have one)
  mantissa &= ~ImpliedOneBit;

  // Alright assemble the final double's bits, which means shifting and
  //  adding the exponent into its proper place.
  //  (if we had a sign to apply we'd apply it to the top bit). 
  ulong assembled = mantissa | (((ulong)exponent) << ExponentShift);
  double result = BitConverter.UInt64BitsToDouble(assembled);
  return result;
}

Floating Point Numbers and Rounding

I was writing about how to parse C++17-style hex floating point literals, and in doing so I ended up writing a bunch about how floats work in general (and specifically how floating point rounding happens), so I opted to split it off from that post into its own, since it’s already probably way too many words as it is?

Here we go!

How Floats Work

sign exponent mantissa 0 f e+f

If you don’t know, a floating point number (At least, an IEEE 754 float, which effectively all modern hardware supports), consists of three parts:

  • Sign bit – the upper bit of the float is the sign bit: 1 if the float is negative, 0 if it’s positive.
  • Exponent – the next few bits (8 bits for a 32-bit float, 11 bits for a 64-bit float) contain the exponent data, which is the power of two to multiply the given hex value with. (Note that the exponent is stored in a biased way – more on that in a moment)
  • Mantissa – the remaining bits (23 for 32-bit float, 52 for a 64-bit float) represent the fractional part of the float’s value.

In general (with the exception of subnormal floats and 0.0, explained in a bit) there is an implied 1 in the float: that is, if the mantissa has a value of “FC3CA0000” the actual float is 1.FC3CA0000 (the mantissa bits are all to the right of the decimal point) before the exponent is applied. Having this implied 1 gives an extra bit of precision to the value since you don’t even have to store that extra 1 bit anywhere – it’s implied! Clever.

The exponent represents the power of two involved (Pow2(exponent)), which has the nice property that multiplies or divides of a float by powers of two do not (usually, except at extremes) affect the precision of the number, dividing by 2 simply decrements the exponent by 1, and multiplying by 2 increments the exponent by 1.

For a double-precision (64-bit) float, the maximum representable exponent is 1023 and the minimum is -1022. These are stored in 11 bits, and they’re biased (which is to say that the stored 11 bits is actualExponent + bias where the bias is 1023. That means that this range of [-1022, 1023] is actually stored as [1, 2046] (00000000001 and 11111111110 in binary). This range uses all but two of the possible 11-bit values, which are used to represent two sets of special cases:

  • Exponent value 00000000000b represents a subnormal float – that is, it still has the effective exponent of -1022 (the minimum representable exponent) but it does NOT have the implied 1 – values smaller than this start to lose bits of precision for every division by 2 as it can’t decrement the exponent any farther and so ends up sliding the mantissa to the right instead.
    • For this 00000000000b exponent, if the mantissa is 0, then you have a value of 0.0 (or, in a quirk of floating point math, -0.0 if the sign bit is set).
  • Exponent value 11111111111b represents one of two things:
    • If the mantissa is zero, this is infinity (either positive or negative infinity depending on the sign bit).
    • If the mantissa is non-zero, it’s NaN (not a number).
      • (There are two types of NaN, quiet and signaling. Those are a subject for another time, but the difference bit-wise is whether the upper bit of the mantissa is set: if 1 it’s quiet, if 0 then it’s signalling).

If you wanted to write a bit of math to calculate the value of a 64-bit float (ignoring the two special exponent cases) it would look something like this (where bias in this case is 1023):

(signBit ? -1 : 1) 
  * (1 + (mantissaBits / Pow2(52 + 1)) 
  * Pow2(exponentBits - bias)

Standard Float Rounding

Okay, knowing how floats are stored, clearly math in the computer isn’t done with infinite precision, so when you do an operation that drops some precision, how does the result get rounded?

When operations are done with values with mismatched exponents, the value with the lowest exponent is effectively shifted to the right by the difference to match the exponents.

For example, here’s the subtraction of two four-bit-significand (3 bits of mantissa plus the implied 1) floats:

  1.000 * 2^5
- 1.001 * 2^1

The number being subtracted has the smaller exponent, so we end up shifting it to the right to compensate (for now, doing it as if we had no limit on extra digits):

  1.000 0000 * 2^5
- 0.000 1001 * 2^5 // Shifted right to match exponents
------------------
  0.111 0111 * 2^5
  1.110 111  * 2^4 // shifted left to normalize (fix the implied 1)
  1.111      * 2^4 // round up since we had more than half off the edge

Note that in this example, the value being subtracted shifted completely off the side of the internal mantissa bit count. Since we can’t store infinite off-the-end digits, what do we do?

Float math uses three extra bits (to the “right” of the mantissa), called the guard bit, the round bit, and the sticky bit.

As the mantissa shifts off the end, it shifts into these bits. This works basically like a normal shift right, with the exception that the moment that ANY 1 bit get shifted into the sticky bit, it stays 1 from that point on (that’s what makes it sticky).

For instance:

      G R S
1.001 0 0 0 * 2^1
0.100 1 0 0 * 2^2 // 1 shifts into the guard bit
0.010 0 1 0 * 2^3 // now into the round bit
0.001 0 0 1 * 2^4 // now into the sticky bit
0.000 1 0 1 * 2^5 // sticky bit stays 1 now

Note that the sticky bit stayed 1 on that last shift, even though in a standard right shift it would have gone off the end. Basically if you take the mantissa plus the 3 GRS bits (not to be confused with certain cough other meanings of GRS) and shift it to the right, the operation is the equivalent of:

mantissaAndGRS = (mantissaAndGRS >> 1) | (mantissaAndGRS & 1)

Now when determining whether to round, you can take the 3 GRS bits and treat them as GRS/8 (i.e. GRS bits of 100b are the equivalent of 0.5 (4/8), and 101b is 0.625 (5/8)), and use that as the fraction that determines whether/how you round.

The standard float rounding mode is round-to-nearest, even-on-ties (that is, if it could round either way (think 1.5, which is equally close to either 1.0 or 2.0), you round to whichever of the neighboring values is even (so 1.5 and 2.5 would both round to 2).

Using our bits, the logic, then, is this:

  • If the guard bit is not set, then it rounds down (fraction is < 0.5), mantissa doesn’t change.
  • If the guard bit IS set:
    • If round bit or sticky bit is set, always round up (fraction is > 0.5), mantissa increases by 1.
    • Otherwise, it’s a tie (exactly 0.5, could round either way), so round such that the mantissa is even (the lower bit of the mantissa is 0), mantissa increments if the lower bit was 1 (to make it even).

Okay, so if all we care about is guard bit and then , why even have three bits? Isn’t two bits enough?

Nope! Well, sometimes, but not always. Turns out, some operations (like subtraction) can require a left shift by one to normalize the result (like in the above subtraction example), which means if you only had two bits of extra-mantissa information (just, say, a round and sticky bit) you’d be left with one bit of information after the left shift and have no idea if there’s a rounding tiebreaker.

For instance, here’s an operation with the proper full guard, round, and sticky bits:

  1.000 * 2^5
- 1.101 * 2^1

  // Shift into the GRS bits:
  1.000 000 * 2^5
- 0.000 111 * 2^5 // sticky kept that lowest 1
------------------
  0.111 001 * 2^5
  1.110 01  * 2^4 // shift left 1, still 2 digits left
  1.110     * 2^5 // Round down properly

If this were done with only two bits (round and sticky) we would end up with the following:

  1.000 * 2^5
- 1.101 * 2^1

  // Shift into just RS bits:
  1.000 00 * 2^5
- 0.000 11 * 2^5
------------------
  0.111 01 * 2^5
  1.110 1  * 2^4 // shift left 1, still 2 digits left

Once we shift left there, we only have one bit of data, and it’s set. We don’t know whether or not we had a fraction > 0.5 (meaning we have to round up) or < 0.5 (meaning we round to even, which is down in this case).

So the short answer is: three bits because sometimes we have to shift left and lose one bit of the information, and we need at least a “half bit” and a “tiebreaker bit” at all times. Given all the float operations, 3 suffices for this, always.