(This post follows Part 1: 32-bit floats and will make very little sense without having read that one first. Honestly, it might make little sense having read that one first, I dunno!)
Last time we went over how to calculate the results of the FMAdd instruction (a fused-multiply-add calculated as if it had infinite internal precision) for 32-bit single-precision float values:
Calculate the double-precision product of a and b
Add this product to c to get a double-precision sum
Calculate the error of the sum
Use the error to odd-round the sum
Round the double-precision sum back down to single precision
This requires casting up to a 64-bit double-precision float to get extra bits of precision. But what if you can't do that? What if you're using doubles? You can't just (in most cases) cast up to a quad-precision float. So what do you do?
To do this natively as doubles, we need to invent a new operation: MulWithError. This is the multiplication equivalent of the AddWithError function from the 32-bit solution:
(double prod,double err)MulWithError(double x,double y){double prod = x * y;double err =// ??? how do we do thisreturn(prod, err);}
We'll get to how to implement that in a moment, but first we'll walk through how to use that function to calculate a proper FMAdd.
A thing that I had to do at work is write an emulation of the FMAdd (fused multiply-add) instruction for hardware where it wasn't natively supported (specifically I was writing a SIMD implementation, but the idea is the same), and so I thought I'd share a little bit about how FMAdd works, since I've already been posting about how float rounding works.
So, screw it, here we go with another unnecessarily technical, mathy post!
What is the FMAdd Instruction?
A fused multiply-add is basically doing a multiply and an add as a single operation, and it gives you the result as if it were computed with infinite precision and then rounded down at the final result. FMAdd computes (a * b) + c without intermediate floating-point error being introduced:
floatFMAdd(float a,float b,float c){// ??? Somehow do this with no intermediate roundingreturn(a * b)+ c;}
Computing it normally (using the code above) for some values will get you double rounding (explained in a moment) which means you might be an extra bit off (or, more formally, one ULP) from where your actual result should be. An extra bit doesn't sound like a lot, but it can add up over many operations.
Fused multiply-add avoids this extra rounding, making it more accurate than a multiply followed by a separate add, which is great! (It can also be faster if it's supported by hardware but, as you'll see, computing it without a dedicated instruction on the CPU is actually surprisingly spendy, especially once you get into doing it for 64-bit floats, but sometimes you need precision instead of performance).
C++17 added support for hex float literals, so you can put more bit-accurate floating point values into your code. They're handy to have, and I wanted to be able to parse them from a text file in a C# app I was writing.
I had a bit of a mental block on this number format for a bit - like, what does it even mean to have fractional hex digits? But it turns out it's a concept that we already use all the time and my brain just needed some prodding to make the connection.
With our standard base 10 numbers, moving the decimal point left one digit means dividing the number by 10:
12.3==1.23*10^1==0.123*10^2
Hex floats? Same deal, just in 16s instead:
0x1B.C8==0x1.BC8*16^1==0x0.1BC8*16^2
Okay, so now what's the "p" part in the number? Well, that's the start of the exponent. A standard float has an exponent starting with 'e':
1.3e2==1.3*10^2
But 'e' is a hex digit, so you can't use 'e' anymore as the exponent starter, so they chose 'p' instead (why not 'x', the second letter? Probably because a hex number starts with '0x', so 'x' also already has a use - but 'p' is free so it wins)
The exponent for a hex float is in powers of 2 (so it corresponds perfectly to the exponent as it is stored in the value), so:
0x1.ABp3==0x1.AB*2^3
So that's how a hex float literal works! Here's a quick breakdown:
I've ported the blog off of Wordpress onto a static site generator (Specifically, 11ty/Eleventy). I have a bit more control over the format here, and it's easier for me to write pages (Wordpress was fighting me on all sorts of formatting which I can just do now).
What this means is I can finally start copying over the rest of my posts that were on (the now-defunct, sadly) cohost.org (rest easy, little eggbug).
Likely there are things on the new site that aren't set up correctly yet, so if you happen to notice anything, find me on Bluesky or Mastodon and let me know!
I was writing about how to parse C++17-style hex floating point literals, and in doing so I ended up writing a bunch about how floats work in general (and specifically how floating point rounding happens), so I opted to split it off from that post into its own, since it’s already probably way too many words as it is?
Here we go!
How Floats Work
If you don’t know, a floating point number (At least, an IEEE 754 float, which effectively all modern hardware supports), consists of three parts:
Sign bit – the upper bit of the float is the sign bit: 1 if the float is negative, 0 if it’s positive.
Exponent – the next few bits (8 bits for a 32-bit float, 11 bits for a 64-bit float) contain the exponent data, which is the power of two to multiply the given hex value with. (Note that the exponent is stored in a biased way – more on that in a moment)
Mantissa – the remaining bits (23 for 32-bit float, 52 for a 64-bit float) represent the fractional part of the float’s value.
A while back (over a year ago) this blog got hacked and so it’s been down for a hot minute. I’ve brought it back online and now I can finally post things on it again – I have a few things that I’ll post here and there but honestly it’s unlikely to ever be really FREQUENT around here.
Valve recently announced that Procyon is in the most-recent batch of titles to be given the green light for release on Steam! This is super-exciting news!
There are a few things that I want to add to Procyon before it’s ready for general Steam release: achievements, proper leaderboards, steam overlay support, etc. Basically, all of the steam features that make sense.
It’ll be a bit before it gets done, as we’re working on putting the finishing touches on Infamous: Second Son at work, so I’m a little dead by the time I get home at the moment. But once we’ve shipped, I’ll likely have the time and energy to get Procyon rolling out onto Steam.
It’s been a ridiculously long time coming, but I’m finally breaking my radio silence on this poor, poor blog to let you know that Procyon has, in fact, finally been released!
It’s available on Desura and IndieCity:
(Way-late 2022 edit: no it’s not, anymore – both of those storefronts are gone. But it’s on Steam!)
So apparently I forgot how to update my blog for over a year, making the previous post’s title even more accurate than it should have been.
What’s been going on, you ask? I’ll tell you.
I’ve switched jobs! Now I’m working at Sucker Punch Productions as a coder (working mostly on missions and the like). It’s a totally fantastic place to work. If you look ever-so-slightly close, you can find my name in the Infamous 2 credits 🙂
I’ve entered Procyon into DreamBuildPlay and this year’s PAX 10 (no love from the judges, though)
I’ve been lazy about updating my blog!
And I’ve updated the holy craps out of Procyon!
Entirely new enemies (and enemy art)
New levels
Updated special effects
A new level
All sorts of new craziness
Okay, the list isn’t as long as last post’s, but I have been busy. Some new videos: