Posts tagged “programming”
A while back I was working on a programming language idea and while I haven't made any progress on it in ages, I really liked the string design that I came up with. I don't know that any of the ideas are original, but I haven't seen anything exactly like it so I figure I'd throw the idea out into the ether in case anyone else happens to do something similar in their own personal language that they definitely shouldn't be making 😆
(I'll note upfront: this design uses dollar signs ($
) and backticks(`
). There are many languages that do so (like Javascript!) but these keys are not universally on all keyboards internationally so for those locales it may be more difficult to type these out...my language design was pretty much just for me with my standard US keyboard so I didn't take this into account)
Basic Strings
There's nothing fancy about these, they're just like most other languages' basic strings:
"This is a string"
"This is a string with a newline at the end: \n"
"Quotes? \"Escape them\""
"Oops forgot to end this one, it's a compiler error
Just a pair of double quotes with everything between being non-quotes (or escaped quotes), contained within a single line.
Raw Strings
Raw strings in my language design are kind of a blend between C++11-style raw strings and C#11 raw strings (the elevens!), using a different delimiting character variation than I've seen elsewhere. In this case, the simplest one starts and ends with a pair of backticks:
``This is a single raw string
it is multiple lines long``
It can contain any character sequence except the delimiting sequence (again, at simplest a pair of backticks: ``)
But what if your string needs to contain a consecutive pair of backticks? This is where the C++11 raw string inspiration comes in: you can put any string of characters between the ticks (excluding ticks, obviously, or newlines), and then the start and end have to match.
`uniqueString`This is a single string
it contains backticks without terminating: `` ... see?
This is the last line and ends here:`uniqueString`
This one starts with `uniqueString`
, and so the only thing that will terminate it is that same sequence: `uniqueString`
(with the tick marks around it).
To add to this, cribbing from C#11 it will:
- Trim the very first newline if there is one
- Also trim the last newline if there is one
- Unindent every line of it based on the indentation of the final quote sequence:
myString = ``
This is actually the first line of the string, the newline was ignored
{
indented further
}
``;
which turns into the string (note the lack of being completely indented:
This is actually the first line, the newline was ignored
{
indented further
}
This makes it easier to generate code (or text files or whatever) that are properly indented, without having to make the indenting of the string in your code all weird.
Interpolated Strings
I'm additionally adding string interpolated string support (which is a weird term), using a mix of C# and Javascript's setup:
$"This string has a ${value} in it"
If a string starts with $
, it's treated as an interpolated string. A string-convertible expression can be inserted in-place in the string within ${}
.
But what if you need to have the character sequence ${
in your string? Add more dollar signs to the start, and you need that many dollar signs before a {
to enter the Interpolation Zone:
$$"This string has a $${value} in it, ${but this isn't one}"
Raw strings can also be used as interpolated strings (making for some nice codegen), same rules apply:
$$``
Interpolated string with
multiple lines and a $${value} in it.
${this is not a value because only one $}
``
If value == 5
this would turn into the following string (upon formatting):
Interpolated string with
multiple lines and a 5 in it.
${this is not a value because only one $}
I Just Think They're Neat
Anyway, I think this is a really nice combination of properties that make it easy to format strings nicely without being overly-complicated to actually use (unlike C++'s raw strings, which I have to look up literally every time I need to use one). Need a hardcoded regex or path with backslashes? In most cases, just use a raw string with ``
on either end:
``C:\Path\With\Single\Backslashes``
or
``^[\r\n \t]*Hi[\r\n \t]*$``
Hope this was at least mildly interesting to someone!
(This post follows Part 1: 32-bit floats and will make very little sense without having read that one first. Honestly, it might make little sense having read that one first, I dunno!)
Last time we went over how to calculate the results of the FMAdd instruction (a fused-multiply-add calculated as if it had infinite internal precision) for 32-bit single-precision float values:
- Calculate the double-precision product of
a
and b
- Add this product to
c
to get a double-precision sum
- Calculate the error of the sum
- Use the error to odd-round the sum
- Round the double-precision sum back down to single precision
This requires casting up to a 64-bit double-precision float to get extra bits of precision. But what if you can't do that? What if you're using doubles? You can't just (in most cases) cast up to a quad-precision float. So what do you do?
Double-Precision FMAdd
(Like the 32-bit version, this is based on Emulation of FMA and correctly-rounded sums: proved algorithms using rounding to odd by Sylvie Boldo and Guillaume Melquiond, there are additional details in there)
To do this natively as doubles, we need to invent a new operation: MulWithError
. This is the multiplication equivalent of the AddWithError
function from the 32-bit solution:
(double prod, double err) MulWithError(double x, double y)
{
double prod = x * y;
double err =
return (prod, err);
}
We'll get to how to implement that in a moment, but first we'll walk through how to use that function to calculate a proper FMAdd.
We need to do the following:
- Calculate the product of
a
and b
and the error of that product
- Calculate the sum of that product and
c
(giving us a * b + c
) and the error of this sum
- We're not using the error of the product ... yet
- Add the two error terms (product error and sum error) together, rounding the result to odd
- Add this summed error term to our actual result, which will round normally.
In code, that looks like this:
double OddRoundedAdd(double x, double y)
{
(double sum, double err) = AddwithError(x, y);
return RoundToOdd(sum, err);
}
double FMAdd(double a, double b, double c)
{
(double ab, double abErr) = MulWithError(a, b);
(double abc, double abcErr) = AddWithError(ab, c);
double err = OddRoundedAdd(abErr, abcErr);
return abc + err;
}
By keeping the error terms from both the product and the sum, we have kept all of the exact result. That is, we can assemble the mathematically-exact result given enough precision by doing abc + abErr + abcErr
.
But we can't do infinite-precision addition of three values. However, we can odd-round an intermediate result, the same way we did with the single-precision case.
In this case, we know that abErr
and abcErr
both (necessarily) have much lower magnitudes than the final result, as each error value's highest bit is lower than the lowest bit of the mantissa of their respective operations. So, if we odd-round the sum of these two values, it actually effectively fulfills the condition of having more bits of precision than the final result. Thus, if we add the error terms together with odd rounding, the odd-rounded fake-sticky final digit will be taken into account by the actual sticky bit used when doing the final sum of the result and error terms.
I hope that makes sense?
Breaking Down Multiplication With Error
(Like AddWithError
, this is based on A Floating-Point Technique For Extending the Available Precision by T.J. Dekker)
So how do we calculate the error term of a 64-bit multiply? We can't use 128-bit values, but what we can do is break each 64-bit value up into two values, each with less bits of precision.
We'll break x
and y
(our two multiplicands) up into high and low values, where:
x = xh + xl;
y = yh + yl;
We do this by breaking the double's mantissa up:
xh
contains the top 25 bits of x
's mantissa (plus its implied 1, giving it 26 bits of precision).
xl
contains the bottom 27 bits of x
's mantissa. The highest-set '1' in this mantissa will become the implied 1
bit that's part of the standard floating-point format, so this value will have 27 bits of precision, max.
yh
and yl
are the same, but for y
.
(double h, double l) Split(double v)
{
double h = ZeroBottom27BitsOfMantissa(v);
double l = v - h;
return (h, l);
}
What does this split give us? Well, we can now break the multiplication up into a sum of multiplies that now each have enough bits of precision to be exactly representable (27BitValueA * 27BitValueB == 54BitValue
, which fits perfectly in a double (with the implied 1
bit), using our old friend from Algebra, FOIL:
x * y
= (xh + xl) * (yh + yl)
= xh*yh + xh*yl + xl*yh + xl*yl;
We can't actually do those adds directly, but what we can do is similar to how we did AddWithError
: use a sequence of precision-preserving operations to calculate the difference between that idealized result and our rounded product:
(double prod, double err) MulWithError(double x, double y)
{
double prod = x * y;
(xh, xl) = Split(x);
(yh, yl) = Split(y);
double err = (((xh*yh - prod) + xh*yl) + xl*yh) + xl*yl;
return (prod, err);
}
It works like this:
- Calculate the (rounded) product of
x
and y
- Subtract that rounded product from the product of
xh
and yh
- These should have roughly the same magnitude (and definitely the same sign) so this is a precision-preserving subtraction.
- Since
|xh * yh| <= rounded(|x * y|)
(because xh
and yh
are truncated versions of x
and y
and thus have lower magnitudes) this is a smaller - larger
operation and we'll get a result with a sign opposite that of the final product.
- Keep adding in next-lower-magnitudes of values, which will continue to preserve precision
- (because we have a value that is opposite-sign these are effectively subtractions, in the same way that
a + -b
is)
- It's also worth noting here that
xh*yl
and xl*yh
will have equivalent magnitudes so the order that you add them in doesn't matter, as long as they're both after xh*yh
and before xl*yl
Once you've done that, you have the computed product as well as the error term, and we can then follow our FMAdd algorithm above to calculate the FMAdd.
So, that's it, we're done, right?
Edge Cases
Nope! Well, yes if you just wanted the gist, but now it's time to get into all those annoying implementation details that the papers this is based on completely glossed over. Here's where it gets ugly (unless you thought it was already ugly, in which case, sorry, it's about to get worse somehow).
In our single-precision case, we didn't have to worry about exponent overflow or underflow because we were using double-precision intermediates, which not only have additional mantissa range, but also additional exponent range.
It's possible that the product of a * b
(an intermediate value in our calculation) goes out of range of what a double can represent, but that the addition of c
might bring the final result back into range (which can happen when the sign of c
is opposite the sign of a * b
). This causes a different set of errors on either end:
- If
a*b
is too large to be represented, it turns into infinity which means that adding c
in will just leave it as infinity even though the final result should have been a representable value (albeit one with a very large magnitude)
- If
a*b
is too small to be represented, it will go subnormal which means bits of the intermediate result will slide off the bottom of the mantissa and we lose bits of information, which can causes us to round incorrectly at our final result.
To solve this, we'll introduced a bias into the calculation, for when the value goes very small or very large:
double CalculateFMAddBias(double a, double b, double c)
{
double testResult = Abs(a * b + c);
if (testResult < Pow2(-500) && Max(a, b, c) < Pow2(800))
{
return Pow2(110);
}
else if (IsInfinite(testResult))
{
return Pow2(-55);
}
else
{
return 1.0;
}
}
For any results that aren't extreme, the bias will remain 1.0
but, for values at the extremes, we'll scale our intermediates down (using powers of 2 which only affect the exponent and not the mantissa) into a range such that we can't temporarily poke outside of range. Also note that my choices of powers of 2 are not perfectly chosen, I didn't bother trying to figure out the exact right biases/thresholds so I just picked ones that I knew were good enough.
So then we do our FMAdd calculation as before, but with the bias introduced (and then backed out at the end):
(double ab, double abErr) = MulWithError(a * bias, b);
(double abc, double abcErr) = AddWithError(ab, c * bias);
err = OddRoundedAdd(abErr, abcErr);
return (abc + err) / bias;
Alright, we've avoided both overflow and underflow and everything is great, right?
Two (Point Five) Last Annoying Implementation Details
Nope, sorry again! It turns out there are still two cases we need to deal with.
Case 1: Infinity or NaN even with the bias
If our result (without error applied) hits infinity even with the avoid-infinity bound, then we should just go ahead and return now to avoid Causing Problems Later (that is, turning what should be infinity into a NaN). And if it's already NaN we can just return now because it's going to be NaN forever.
Except, there's one additional necessary check here, for a case caught by dzaima over on Bluesky): in the event a
and b
are finite numbers but a * b
blows up to infinity, and then c
is the opposite infinity, the correct return value is whichever infinity (positive or negative) c
is, so in our early-out check has to catch that case as well:
if (IsInfiniteOrNaN(abc))
{
if (IsInfinite(c) && !IsInfiniteOrNaN(a) && !IsInfiniteOrNaN(b))
{ return c; }
return abc;
}
Case 2: Subnormal Results
If our result is subnormal (after the bias is backed out), then it's going to lose bits of precision as it shifts down (because the exponent can't go any lower so instead the value itself shifts down the mantissa), which means whoops here's another rounding step, and the dreaded double-rounding returns.
In this case we need to actually odd-round the addition of the error term as well, so that when the bias is backed out and it rounds, it does the correct thing:
const double SubnormThreshold = Pow2(-1022) * AvoidDenormalBias;
if (bias == AvoidSubnormalBias && Abs(abc) < SubnormThreshold)
{
(double finalSum, finalSumErr) = AddWithError(abc, err);
finalSum = RoundToOdd(finalSum, finalSumErr);
return finalSum / bias;
}
And this almost works, except there's one more annoying case, and that's where our result is going subnormal, but only by exactly one bit. Remember that the odd-rounding trick only works if we have two or more bits so that the final rounding works properly, but in this case we're truncating the mantissa by exactly one bit, so we have to do even more work:
- Split the value that will be shifting down into a high and low part (same as we did for the multiply)
- Add our error term to the low part of it
- This preserves additional bits of the error term since we gave ourselves more headroom by removing the upper half of its mantissa
- Remove the bias from both the high and low parts separately
- Removing the bias from the high part doesn't round since we know the lowest bit is 0
- Removing the bias from the low part applies the actual final rounding (correctly) since we gave ourselves more bits to work with
- Sum the halves back together and return that as our final result
- This sum is (thankfully) perfectly representable by the final precision and doesn't introduce any additional error.
const double OneBitSubnormalThreshold =
OneBitSubnormalThreshold * 0.5;
if (Abs(finalResult.result) >= k_oneBitDenormThreshold)
{
(rh, rl) = Split(finalSum);
rl = OddRoundedAddition(rl, finalSumErr);
rh /= bias;
rl /= bias;
return rh + rl;
}
OMG Are We Done Now?
As far as I'm aware, those are all the implementation details to doing a 64-bit double-precision FMAdd implementation. It's conceptually not that much more complicated than the 32-bit one, but mechanically it's worse, plus there are those fun extra edge cases to consider.
Here's the final code:
(double h, double l) Split(double v)
{
double h = ZeroBottom27BitsOfMantissa(v);
double l = v - h;
return (h, l);
}
(double prod, double err) MulWithError(
double x,
double y)
{
double prod = x * y;
(xh, xl) = Split(x);
(yh, yl) = Split(y);
double err =
(((xh*yh - prod) + xh*yl) + xl*yh) + xl*yl;
return (prod, err);
}
double OddRoundedAdd(double x, double y)
{
(double sum, double err) = AddwithError(x, y);
return RoundToOdd(sum, err);
}
double FMAdd(double a, double b, double c)
{
const double AvoidSubnormalBias = Pow2(110);
double bias = 1.0;
{
double testResult = Abs(a * b + c);
if (testResult < Pow2(-500) && Max(a, b, c) < Pow2(800))
{ bias = AvoidSubnormalBias; }
else if (IsInfinite(testResult))
{ bias = Pow2(-55); }
}
(double ab, double abErr) = MulWithError(a * bias, b);
(double abc, double abcErr) = AddWithError(ab, c * bias);
if (IsInfiniteOrNaN(abc))
{
if (IsInfinite(c) && !IsInfiniteOrNaN(a) && !IsInfiniteOrNaN(b))
{ return c; }
return abc;
}
double err = OddRoundedAdd(abErr, abcErr);
const double SubnormThreshold = Pow2(-1022) * AvoidSubnormalBias;
if (bias == AvoidSubnormalBias && Abs(abc) < SubnormThreshold)
{
(double finalSum, finalSumErr) = AddWithError(abc, err);
const double OneBitSubnormalThreshold =
OneBitSubnormalThreshold * 0.5;
if (Abs(finalSum) >= OneBitSubnormalThreshold)
{
(rh, rl) = Split(finalSum);
rl = OddRoundedAdd(rl, finalSumErr);
rh /= bias;
rl /= bias;
return rh + rl;
}
else
{
finalSum = RoundToOdd(finalSum, finalSumErr);
return finalSum / bias;
}
}
else
{
return (abc + err) / bias;
}
}
Compare that to the 32-bit version and you can see why this one got its own post:
float FMAdd(float a, float b, float c)
{
double product = double(a) * double(b);
(double sum, double err) = AddWithError(product, c);
sum = RoundToOdd(sum, err);
return float(sum);
}
Hopefully you never have to actually implement this yourself, but if you do? I hope this helps.
A thing that I had to do at work is write an emulation of the FMAdd (fused multiply-add) instruction for hardware where it wasn't natively supported (specifically I was writing a SIMD implementation, but the idea is the same), and so I thought I'd share a little bit about how FMAdd works, since I've already been posting about how float rounding works.
So, screw it, here we go with another unnecessarily technical, mathy post!
What is the FMAdd Instruction?
A fused multiply-add is basically doing a multiply and an add as a single operation, and it gives you the result as if it were computed with infinite precision and then rounded down at the final result. FMAdd computes (a * b) + c
without intermediate floating-point error being introduced:
float FMAdd(float a, float b, float c)
{
return (a * b) + c;
}
Computing it normally (using the code above) for some values will get you double rounding (explained in a moment) which means you might be an extra bit off (or, more formally, one ULP) from where your actual result should be. An extra bit doesn't sound like a lot, but it can add up over many operations.
Fused multiply-add avoids this extra rounding, making it more accurate than a multiply followed by a separate add, which is great! (It can also be faster if it's supported by hardware but, as you'll see, computing it without a dedicated instruction on the CPU is actually surprisingly spendy, especially once you get into doing it for 64-bit floats, but sometimes you need precision instead of performance).
Double Rounding
Double rounding happens when the intermediate value rounds down (or up), then the final result also rounds in the same direction - but because of the first rounding, actually overshoots the correctly-rounded final value by a bit.
Here's an example using two successive sums of some 4-bit float values. We'll do the following sum (in top-down order):
1.000 * 2^4
+ 1.001 * 2^0
+ 1.100 * 2^0
The first sum, done with "infinite" internal precision, looks like this:
1.000 * 2^4
+ 1.001 * 2^0
-----------
1.000 0000 * 2^4
+ 0.000 1001 * 2^4
----------------
1.000 1001 * 2^4
If we were to then use that result directly (with no intermediate rounding) and do the second sum, only rounding the final result:
1.000 1001 * 2^4
+ 0.000 1100 * 2^4
----------
= 1.001 0101 * 2^4
-> 1.001 * 2^4
The final result rounds (to nearest) to 1.001
.
However, if we were to round that intermediate value to 4 bits first, we'd get this:
1.000 1001 * 2^4
-> 1.001 * 2^4
+ 0.000 1100 * 2^4
----------
1.001 1100 * 2^4
-> 1.010 * 2^4
In this one, we end up with 1.010
instead of 1.001
because of the intermediate rounding, which pushed us past the correctly-rounded final result.
How to Pretend That You Have Infinite Precision
Okay, for FMAdd we want to calculate a multiply, and then somehow throw an add in there and have it act as if we didn't lose any precision on the multiply.
First we're going to handle the case of 32-bit floats (singles) because it's a wildly simpler case on CPUs that have 64-bit floats (doubles).
(also, sorry in advance, the term "double" for a "double-precision float" and the "double" in "double rounding" are two different instances of "double" but I've written so much of this post and like hell am I changing it now so hopefully it's not too confusing)
The immediately obvious thing to try to get an accurate single-precision FMA is "hey, what if we do the multiply and add as doubles and then round the result back down to a single":
float FMAdd(float a, float b, float c)
{
return float((double(a) * double(b)) + double(c));
}
While that gives a much better result than doing it as pure 32-bits, it actually can still have double rounding. But where does the extra rounding come from, in this case?
The multiply itself isn't the source of the first rounding: Surprisingly (to me, at least): casting two singles to doubles and multiplying those together always results in an exact answer - this is because each of the single-precision values has 24 bits of precision, but a double can store 53 bits of precision, which is more than enough to store the result of multipling two singles (2 * 24
bits of precision max). Since floats are stored as:
sign * 1.mantissa * 2^(exponent)
...it means we're multiplying two numbers of the form 1.xxxxxxxxxx
and 1.yyyyyyyyy
together then adding the exponents together to get the new number, so unlike addition and subtraction (where, say, 1 + 1*10^60
requires a ton of extra precision), if two float numbers have wildly different exponents it doesn't actually matter because the exponents and significand values are handled separately.
To illustrate this, let's pretend we have two 4-digit (base 10) numbers and we multiply them and store the result using 8 digits (double precision):
(1.234 * 2^1) * (1.457 * 2^100)
-> (1.2340000 * 1.4570000) * (2^1 * 2^100)
= 1.7979380 * 2^101
Great, so the double-precision multiply is fine and introduces no rounding at all. So then how do we get double rounding?
As mentioned above, an add (or subtract) can introduce rounding:
(1.234 * 10^0) + (1.457 * 10^9)
-> (1.2340000 * 10^0) + (1.4570000 * 10^9)
= (1.45700001234 * 10^9)
-> 1.45700000
This rounding happens at double precision (so well below the threshold of our target 32-bit result), but there's still rounding, and then the value is rounded again when converted back down to single precision. That's the double rounding and the source of a potential error.
Okay, so, double rounding is bad? Kinda! But it turns out there is a way to introduce a new rounding mode to use for the first rounding that, in the right situations, does not introduce any additional error and ensures that your final result is correct.
A New Rounding Mode?
(This technique is based off of the paper Emulation of FMA and correctly-rounded sums: proved algorithms using rounding to odd by Sylvie Boldo and Guillaume Melquiond; if you want the full technical details of this whole process, that's where you'll find them. Believe it or not, I'm actually trying to go into less detail!)
The key to eliminating the extra precision loss is by using a non-standard rounding mode: rounding to odd. Standard floating point rounding calculates results with some additional bits of precision (three bits, to be precise), and then rounds based on the result (usually using "round to nearest with round to even on ties", although that detail doesn't end up mattering here - this technique works with any standard rounding mode).
So, assume that we have some way of calculating a double precision addition and also having access to the error between the calculated result and the mathematically exact result. Given those two values we can perform a Round To Odd step:
double RoundToOdd(double value, double errorTerm)
{
if (errorTerm != 0.0
&& LowestBitOfMantissa(value) == 0)
{
if (errorTerm > 0)
{
value = AddOneBitToMantissa(value);
}
else
{
value = SubtractOneBitFromMantissa(value);
}
}
}
Basically: if we have any error at all, and the mantissa is currently even, either add or subtract a single bit's worth of mantissa, based on the sign of the error.
(In practice, I found that I also had to ensure the result was not Infinity before doing this operation, since I implemented this using some bitwise shenanigans that would end up "rounding" Infinity to NaN, so, you know, watch out for that).
Why Does Odd-Rounding the Intermediate Value Work?
Round to odd works as long as we have more bits of value than the final result - specifically we need at least two extra bits. Standard float rounding makes use of something called a "sticky" bit - basically the lowest bit of the extra precision is a 1 if any of the bits below it would have been 1.
And, hey, that is basically what "round to odd" does!
- If the mantissa is odd, regardless of whether there's error or not the lowest bit is already odd.
- If the error was positive and the mantissa was even, we set the lower bit to 1 anyway, effectively stickying (yeah that's a word now) all the error bits below it.
- If the error was negative and the mantissa was even, we subtract 1 from the mantissa, making the lower bit odd, and effectively sticky since some of the digits below it are also 1s.
Effectively, round to odd is just "emulate having a sticky bit at the bottom of your intermediate result" - that way, you have a guaranteed tiebreaker for the final rounding step.
But note that I said it requires you to have at least two extra bits. In the case of our using-doubles-instead-of-singles intermediate addition, good news: we have way more than two extra bits - our intermediate value is a whole-ass double-precision float, so we have 29 extra bits vs. our single-precision final value and (mathematically speaking) 29 is greater than 2.
So, for the true single-precision FMAdd instruction we need to do the following:
float FMAdd(float a, float b, float c)
{
double product = double(a) * double(b);
(double sum, double err) = AddWithError(product, c);
sum = RoundToOdd(sum, err);
return float(sum);
}
That's it! ...wait, what's that AddWithError
function, we haven't even--
Calculating An Exact Addition Result
Right, we need to calculate that intermediate addition along with some accurate error term. It turns out it's possible to calculate a set of numbers, sum and error where mathematicallyExactSum = sum + error
.
For this, we have to dive back to July of 1971 and check out the actually typewritten paper A Floating-Point Technique For Extending the Available Precision by T.J. Dekker. Give that a read if you want way more details on this whole thing.
Calculating the error term of adding two numbers (I'll use x
and y
) is relatively straightforward if |x| > |y|
:
sum = x + y;
err = y - (sum - x);
(this is equation 4.14 in the linked paper)
This is just a different ordering of (x + y) - sum
that preserves accuracy: due to the nature of the values involved in these subtractions (sum
's value is directly related to those of x
and y
, and y
is smaller than x
), it turns out that each of those subtractions is an exact result (the paper has a proof of this, and it's a lot so I'm not going to expand on that here), so we get the precise difference between the calculated sum and the real sum.
But this only works if you know that x
's magnitude is larger than (or equal to) y
's. If you don't know which of the two values has a larger magnitude, you can do a bit more work and end up with:
(double sum, double err) AddWithError(
double x,
double y)
{
double sum = a + b;
double intermediate = sum - x;
double err1 = y - intermediate;
double err2 = x - (sum - intermediate);
return (sum, err1 + err2);
}
(This is effectively the expanded version of listing 4.16 from the linked paper)
err1
here is the same as the value in the first version we calculated (a precision-preserving rewrite of (x + y) - sum
)
err2
is, mathematically, x - (sum - (sum - x))
or 0
; its goal is to calculate the error involved in calculating err1, since without the |x| > |y|
guarantee those subtractions might NOT be exact ... but these ones will be.
- Thus, summing these two error terms together gives us a final, precise error term.
(More details in the paper, hopefully this isn't too glossed over that it loses any meaning)
Finally, the End (For Single-Precision Floats)
So, yeah, that's how you implement the FMAdd instruction for single-precision floats on a machine that has double-precision support:
- Calculate the double-precision product of
a
and b
- Add this product to
c
to get a double-precision sum
- Calculate the error of the sum
- Use the error to odd-round the sum
- Round the double-precision sum back down to single precision
But what if you have to calculate FMAdd for double-precision floats? You can't easily just cast up to, like, quad-precision floats and do the work there, so what now? Can you still do this?
The answer is yes, but it's a lot more work, and that's what the next post is about.
C++17 added support for hex float literals, so you can put more bit-accurate floating point values into your code. They're handy to have, and I wanted to be able to parse them from a text file in a C# app I was writing.
Some examples:
0x1.0p0
0x1.8p1
0x8.0p-3
0x0.8p1
0xAB.CDEFp-10
0x0.0000000ABp0
What Am I Even Looking At Here?
(Before reading this, if you don't have a good feel for how floats work, maybe check my previous post about floats (and float rounding))
I had a bit of a mental block on this number format for a bit - like, what does it even mean to have fractional hex digits? But it turns out it's a concept that we already use all the time and my brain just needed some prodding to make the connection.
With our standard base 10 numbers, moving the decimal point left one digit means dividing the number by 10:
12.3 == 1.23 * 10^1 == 0.123 * 10^2
Hex floats? Same deal, just in 16s instead:
0x1B.C8 == 0x1.BC8 * 16^1 == 0x0.1BC8 * 16^2
Okay, so now what's the "p
" part in the number? Well, that's the start of the exponent. A standard float has an exponent starting with 'e
':
1.3e2 == 1.3 * 10^2
But 'e
' is a hex digit, so you can't use 'e
' anymore as the exponent starter, so they chose 'p
' instead (why not 'x
', the second letter? Probably because a hex number starts with '0x
', so 'x
' also already has a use - but 'p
' is free so it wins)
The exponent for a hex float is in powers of 2 (so it corresponds perfectly to the exponent as it is stored in the value), so:
0x1.ABp3 == 0x1.AB * 2^3
So that's how a hex float literal works! Here's a quick breakdown:
<hex-digits>[ '.' [fractional-hex-digits]] 'p' <exponent>
(Parsing code below the fold)
Okay Now What's This About Parsing Them?
C++ conveniently has functions to parse these (std::strtod/strtof handle this nicely). However, if you're (hypothetically) making a parser, and you happen to be writing it in C# which does not have an inbuilt way to parse these, then you'll have to parse your own.
It ended up being a little more complicated than I thought for a couple reasons.
- Arbitrarily long hex strings seem to be supported, which means you need to track both where the top-most non-zero hex value starts (i.e. skip any leading zeros) but also properly handle bits that are extra tiny which may affect rounding.
- In order to properly handle said rounding, running through the hex digits in reverse and pushing them in from the top ends up being a nice strategy, because float rounding works via a sticky bit that stays set as things right-shift down through it.
Ultimately I ended up with the following algorithm (in C#) to parse a hex float, which I'm definitely sure is ~perfect~ and has absolutely no bugs whatsoever. It is also absolutely the most efficient version of this possible, with no room for improvement. Yep.
I'm throwing it in here in case anyone ever finds it useful.
static bool IsDigit(char c) => (c >= '0' && c <= '9');
static bool IsHexDigit(char c)
=> IsDigit(c)
|| (c >= 'A' && c <= 'F')
|| (c >= 'a' && c <= 'f');
double ParseHexFloat(string s)
{
if (s.Length < 2 || s[0] != '0' || char.ToLowerInvariant(s[1]) != 'x')
{ throw new FormatException("Missing 0x prefix"); }
if (s.Length < 3 || !IsHexDigit(s[2]))
{ throw new FormatException(
"Hex float literal must contain at least one whole part digit"); }
int i = 2;
int decimalPointIndex = -1;
int firstNonZeroHexDigitIndex = -1;
while (i < s.Length && (IsHexDigit(s[i]) || s[i] == '.'))
{
if (s[i] == '.')
{
if (decimalPointIndex >= 0)
{ throw new FormatException("Too many decimal points"); }
decimalPointIndex = i;
}
else if (s[i] != '0' && firstNonZeroHexDigitIndex < 0)
{ firstNonZeroHexDigitIndex = i; }
i++;
}
int lastHexDigitIndex = i - 1;
if (lastHexDigitIndex == decimalPointIndex)
{ lastHexDigitIndex--; }
if (decimalPointIndex < 0)
{ decimalPointIndex = i; }
if (i >= s.Length || char.ToLowerInvariant(s[i]) != 'p')
{ throw new FormatException("Missing exponent 'p'"); }
i++;
bool negativeExponent = false;
if (i < s.Length && (s[i] == '+' || s[i] == '-'))
{
negativeExponent = (s[i] == '-');
i++;
}
if (i >= s.Length)
{ throw new FormatException("Missing exponent digits"); }
int exponent = 0;
while (i < s.Length && IsDigit(s[i]))
{
if (int.MaxValue / 10 < exponent)
{ throw new FormatException("Exponent overflow"); }
exponent *= 10;
exponent += (int)(s[i] - '0');
i++;
}
if (negativeExponent)
{ exponent = -exponent; }
if (firstNonZeroHexDigitIndex < 0)
{ return 0.0; }
if (i != s.Length)
{ throw new FormatException("Unexpected characters at end of string"); }
if (decimalPointIndex >= firstNonZeroHexDigitIndex)
{ exponent += ((decimalPointIndex - firstNonZeroHexDigitIndex) * 4) - 1; }
else
{ exponent += ((decimalPointIndex - firstNonZeroHexDigitIndex + 1) * 4) - 1; }
ulong mantissa = 0;
for (i = lastHexDigitIndex; i >= firstNonZeroHexDigitIndex; i--)
{
if (s[i] == '.')
{ continue; }
char c = char.ToLowerInvariant(s[i]);
ulong v = (c >= 'a' && c <= 'f')
? (ulong)(c - 'a' + 10)
: (ulong)(c - '0');
mantissa = (mantissa >> 4) | (mantissa & 0xf);
mantissa |= (v << 60);
}
while ((mantissa & 0x8000_0000_0000_0000ul) == 0)
{
mantissa <<= 1;
exponent--;
}
const int DoubleExponentBias = 1023;
const int MaxBiasedDoubleExponent = 1023 + DoubleExponentBias;
const ulong MantissaMask = 0x000f_ffff_ffff_fffful;
const ulong ImpliedOneBit = 0x0010_0000_0000_0000ul;
const int ExponentShift = 52;
const int MantissaShiftRight = sizeof(double)*8 - ExponentShift - 1;
exponent += DoubleExponentBias;
if (exponent <= 0)
{
mantissa = (mantissa >> 1) | (mantissa & 1);
while (exponent < 0)
{
mantissa = (mantissa >> 1) | (mantissa & 1);
exponent++;
}
}
bool roundBit = (mantissa & 0b10000000000) != 0;
bool tiebreakerBit = (mantissa & 0b01111111111) != 0;
mantissa >>= MantissaShiftRight;
if (roundBit)
{
if (tiebreakerBit || (mantissa & 1) != 0)
{ mantissa++; }
ulong overflowedMask = ImpliedOneBit << ((exponent == 0) ? 0 : 1);
if ((mantissa & overflowedMask) != 0)
{
exponent++;
mantissa >>= 1;
}
}
if (mantissa == 0)
{ return 0.0; }
if (exponent > MaxBiasedDoubleExponent)
{ return double.PositiveInfinity; }
mantissa &= ~ImpliedOneBit;
ulong assembled = mantissa | (((ulong)exponent) << ExponentShift);
double result = BitConverter.UInt64BitsToDouble(assembled);
return result;
}