Floating Point Numbers and Rounding

(originally posted over at Cohost)

I was writing about how to parse C++17-style hex floating point literals, and in doing so I ended up writing a bunch about how floats work in general (and specifically how floating point rounding happens), so I opted to split it off from that post into its own, since it’s already probably way too many words as it is 😅

Here we go!

How Do Floats Work

If you don’t know, a floating-point number (At least, an IEEE 754 float, which effectively all modern hardware supports), consists of three parts:

  • Sign bit – the upper bit of the float is the sign bit: 1 if the float is negative, 0 if it’s positive.
  • Exponent – the next few bits (8 bits for a 32-bit float, 11 bits for a 64-bit float) contain the exponent data, which is the power of two to multiply the given hex value with. (Note that the exponent is stored in a biased way – more on that in a moment)
  • Mantissa – the remaining bits (23 for 32-bit float, 52 for a 64-bit float) represent the fractional part of the float’s value.

In general (with the exception of subnormal floats and 0.0, explained in a bit) there is an implied 1 in the float: that is, if the mantissa has a value of “FC3CA0000” the actual float is 1.FC3CA0000 (the mantissa bits are all to the right of the decimal point) before the exponent is applied. Having this implied 1 gives an extra bit of precision to the value since you don’t even have to store that extra 1 bit anywhere – it’s implied! Clever.

The exponent represents the power of two involved (Pow2(exponent)), which has the nice property that multiplies or divides of a float by powers of two do not (usually, except at extremes) affect the precision of the number, dividing by 2 simply decrements the exponent by 1, and multiplying by 2 increments the exponent by 1.

For a double-precision (64-bit) float, the maximum representable exponent is 1023 and the minimum is -1022. These are stored in 11 bits, and they’re biased (which is to say that the stored 11 bits is actualExponent + bias where the bias is 1023. That means that this range of [-1022, 1023] is actually stored as [1, 2046] (00000000001 and 11111111110 in binary). This range uses all but two of the possible 11-bit values, which are used to represent two sets of special cases:

  • Exponent value 00000000000b represents a subnormal float – that is, it still has the effective exponent of -1022 (the minimum representable exponent) but it does NOT have the implied 1 – values smaller than this start to lose bits of precision for every division by 2 as it can’t decrement the exponent any farther and so ends up sliding the mantissa to the right instead.
    • For this 00000000000b exponent, if the mantissa is 0, then you have a value of 0.0 (or, in a quirk of floating point math, -0.0 if the sign bit is set).
  • Exponent value 11111111111b represents one of two things:
    • If the mantissa is zero, this is infinity (either positive or negative infinity depending on the sign bit).
    • If the mantissa is non-zero, it’s NaN (not a number).
      • (There are two types of NaN, quiet and signaling. Those are a subject for another time, but the difference bit-wise is whether the upper bit of the mantissa is set: if 1 it’s quiet, if 0 then it’s signalling).

If you wanted to write a bit of math to calculate the value of a 64-bit float (ignoring the two special exponent cases) it would look something like this (where bias in this case is 1023):

(signBit ? -1 : 1) 
  * (1 + (mantissaBits / Pow2(52 + 1)) 
  * Pow2(exponentBits - bias)

Standard Float Rounding

Okay, knowing how floats are stored, clearly math in the computer isn’t done with infinite precision, so when you do an operation that drops some precision, how does the result get rounded?

When operations are done with values with mismatched exponents, the value with the lowest exponent is effectively shifted to the right by the difference to match the exponents.

For example, here’s the subtraction of two four-bit-significand (3 bits of mantissa plus the implied 1) floats:

  1.000 * 2^5
- 1.001 * 2^1

The number being subtracted has the smaller exponent, so we end up shifting it to the right to compensate (for now, doing it as if we had no limit on extra digits):

  1.000 0000 * 2^5
- 0.000 1001 * 2^5 // Shifted right to match exponents
  0.111 0111 * 2^5
  1.110 111  * 2^4 // shifted left to normalize (fix the implied 1)
  1.111      * 2^4 // round up since we had more than half off the edge

Note that in this example, the value being subtracted shifted completely off the side of the internal mantissa bit count. Since we can’t store infinite off-the-end digits, what do we do?

Float math uses three extra bits (to the “right” of the mantissa), called the guard bit, the round bit, and the sticky bit.

As the mantissa shifts off the end, it shifts into these bits. This works basically like a normal shift right, with the exception that the moment that ANY 1 bit get shifted into the sticky bit, it stays 1 from that point on (that’s what makes it sticky).

For instance:

      G R S
1.001 0 0 0 * 2^1
0.100 1 0 0 * 2^2 // 1 shifts into the guard bit
0.010 0 1 0 * 2^3 // now into the round bit
0.001 0 0 1 * 2^4 // now into the sticky bit
0.000 1 0 1 * 2^5 // sticky bit stays 1 now

Note that the sticky bit stayed 1 on that last shift, even though in a standard right shift it would have gone off the end. Basically if you take the mantissa plus the 3 GRS bits (not to be confused with certain cough other meanings of GRS) and shift it to the right, the operation is the equivalent of:

mantissaAndGRS = (mantissaAndGRS >> 1) | (mantissaAndGRS & 1)

Now when determining whether to round, you can take the 3 GRS bits and treat them as GRS/8 (i.e. GRS bits of 100b are the equivalent of 0.5 (4/8), and 101b is 0.625 (5/8)), and use that as the fraction that determines whether/how you round.

The standard float rounding mode is round-to-nearest, even-on-ties (that is, if it could round either way (think 1.5, which is equally close to either 1.0 or 2.0), you round to whichever of the neighboring values is even (so 1.5 and 2.5 would both round to 2).

Using our bits, the logic, then, is this:

  • If the guard bit is not set, then it rounds down (fraction is < 0.5), mantissa doesn’t change.
  • If the guard bit IS set:
    • If round bit or sticky bit is set, always round up (fraction is > 0.5), mantissa increases by 1.
    • Otherwise, it’s a tie (exactly 0.5, could round either way), so round such that the mantissa is even (the lower bit of the mantissa is 0), mantissa increments if the lower bit was 1 (to make it even).

Okay, so if all we care about is guard bit and then <round bit OR sticky bit>, why even have three bits? Isn’t two bits enough?

Nope! Well, sometimes, but not always. Turns out, some operations (like subtraction) can require a left shift by one to normalize the result (like in the above subtraction example), which means if you only had two bits of extra-mantissa information (just, say, a round and sticky bit) you’d be left with one bit of information after the left shift and have no idea if there’s a rounding tiebreaker.

For instance, here’s an operation with the proper full guard, round, and sticky bits:

  1.000 * 2^5
- 1.101 * 2^1

// Shift into the GRS bits:
  1.000 000 * 2^5
- 0.000 111 * 2^5 // sticky kept that lowest 1
  0.111 001 * 2^5
  1.110 01  * 2^4 // shift left 1, still 2 digits left
  1.110     * 2^5 // Round down properly

If this were done with only two bits (round and sticky) we would end up with the following:

  1.000 * 2^5
- 1.101 * 2^1

// Shift into just RS bits:
  1.000 00 * 2^5
- 0.000 11 * 2^5
  0.111 01 * 2^5
  1.110 1  * 2^4 // shift left 1, still 2 digits left

Once we shift left there, we only have one bit of data, and it’s set. We don’t know whether or not we had a fraction > 0.5 (meaning we have to round up) or < 0.5 (meaning we round to even, which is down in this case).

So the short answer is: three bits because sometimes we have to shift left and lose one bit of the information, and we need at least a “half bit” and a “tiebreaker bit” at all times. Given all the float operations, 3 suffices for this, always.

Procyon is Greenlit!

Valve recently announced that Procyon is in the most-recent batch of titles to be given the green light for release on Steam! This is super-exciting news!

There are a few things that I want to add to Procyon before it’s ready for general Steam release: achievements, proper leaderboards, steam overlay support, etc. Basically, all of the steam features that make sense.

It’ll be a bit before it gets done, as we’re working on putting the finishing touches on Infamous: Second Son at work, so I’m a little dead by the time I get home at the moment. But once we’ve shipped, I’ll likely have the time and energy to get Procyon rolling out onto Steam.


Procyon is released!

It’s been a ridiculously long time coming, but I’m finally breaking my radio silence on this poor, poor blog to let you know that Procyon has, in fact, finally been released!

It’s available on Desura and IndieCity:

(Way-late 2022 edit: no it’s not, anymore – both of those storefronts are gone. But it’s on Steam!)

Also check out Procyon’s nifty homepage: http://procyongame.com!

Thanks to everyone who helped get this thing out the door!

The Procyon Update Post

So. Here I am, back again after another long hiatus in blog posting. But now that Procyon is almost done, I figured I should share some news!

First and foremost, Procyon has now been posted to Steam Greenlight!

Please go vote for it! Every vote gets the game that much closer to being able to release on Steam!

And, to share some of the fun pieces of video, here’s Procyon’s trailer:

Next up, and just as fun, here’s a look at the in-game intro to Procyon:

Procyon now has a website! You can check it out at http://procyongame.com

Additionally, you can now check out Procyon on Facebook!

Finally, you can also listen to (and purchase) Procyon’s soundtrack on Bandcamp!

Anyway, that’s what I’ve got for now!

Oh Hey, There’s A Blog Here

Is this thing on?


So apparently I forgot how to update my blog for over a year, making the previous post’s title even more accurate than it should have been.

What’s been going on, you ask? I’ll tell you.

  • I’ve switched jobs! Now I’m working at Sucker Punch Productions as a coder (working mostly on missions and the like). It’s a totally fantastic place to work. If you look ever-so-slightly close, you can find my name in the Infamous 2 credits 🙂
  • I’ve entered Procyon into DreamBuildPlay and this year’s PAX 10 (no love from the judges, though)
  • I’ve been lazy about updating my blog!
  • And I’ve updated the holy craps out of Procyon!
    • Entirely new enemies (and enemy art)
    • New levels
    • Updated special effects
    • A new level
    • All sorts of new craziness

Okay, the list isn’t as long as last post’s, but I have been busy. Some new videos:


Hopefully I’ll update a bit more often than once per year. Sorry for the radio silence. I have a lot of things that I could write about, if only I’d take the time to do so.

Ridiculously Sparse Update For A Relatively-Unloved Blog

I’ll go into more detail later, but here’s a SUPER quick synopsis of the last (yikes!) six months:

  • Completed the demo build of Procyon
    • Level 3!
    • New title screen!
    • Finished game flow!
    • Working local multiplayer!
    • New control scheme!
    • New font rendering mechanism!
    • New tutorial!
    • Another bullet point with an accompanying exclamation point!
  • Sent it through several (sometimes-painful) rounds of playtest on the creator’s club forums, and got some great feedback (especially the feedback from one Jason Doucette of Xona Games, who gave some painful-to-hear but necessary criticism)
  • Submitted to the PAX 10 competition
  • Learned that I was not accepted into the PAX 10 competition 🙁
  • Started work fixing the networked multiplayer that I broke while working towards the demo (I’d disabled it for the demo so that I could keep my focus elsewhere!

Anyway, that’s my short update.  But fear not, for here, too, is a pair of videos!

Level 2 Boss-In-Progress

Just another quick update to show off a pair of videos of the in-progress level 2 boss.

First, note that the textures that I have on there currently are horrible, and I know this.  It’s okay.

The first video was simply to test the independent motion of the various moving parts of the boss:

aaaaand the second shows off the missile-launching capabilities of the rear hatches (complete with brand-new missile effects and embarassing flights into the direct path of the missiles!):

That’s all for now!

Update For The Past N Months


It’s been a while.

What have I been up to since early June?  Quite a bit.  On the non-game-development side of things: work’s been rather busy (still).  Also, I now own (and have made numerous improvements on) a house!  So that’s been eating up time.  But, that’s not (really) what this site is about.

Invisible Changes

Much of my time working on Procyon recently has been spent doing changes deep in the codebase: things that, unfortunately, have absolutely no reflection in the user’s view of the product, but that make the code easier to work with or, more importantly, more capable of handling new things.  For instance, enemies are now based on components as per an article on Cowboy Programming, making it super easy to create new enemy behaviors (and combinations of existing behaviors).  It took quite a few days of work to do this, and when I was finished, the entire game looked and played exactly like it had before I started.  However, the upside is that the average enemy is easier to create (or modify).

Graphical Enhancements

I’ve also made a few enhancements to the graphics.  The big one is that I have a new particle system for fire-and-forget particles (i.e. particles that are not affected by game logic past their spawn time).  It’s allowed me to add some some nice new explosion and smoke effects (among other things):

Boomf!Turret Fires

Also, I used them to add some particles at the origins of enemies firing beams:

Beam Particles

Additionally, particles now render solely to an off-screen, lower-resolution buffer, which has allowed the game to run (thus far) at true 1080p on the Xbox 360.

Also, I decided that the Level 1 background (in the first two images in this post) was hideously bland (if such a thing is possible), so I decided to redo it as flying over a red desert canyon (incidentally, the walls of the canyon are generated using the same basic divide-and-offset algorithm as my lightning bolt generator):


Texture + Generator = Texture Generator

The big thing I’ve done, though, was put together a tool that will generate the HLSL required for my GPU-generated textures, so that I didn’t have to constantly tweak HLSL, rebuild my project, reload, etc.  Now I can see them straight in an editor (though not, yet, on the meshes themselves – that is on my list of things to do still).  It looks basically like this:

ShaderEditor-15-AllGeneratedShaderEditor-16-Interesting PatternShaderEditor-19-CircuitBoredom

The basic idea is, I have a snippet of HLSL, something that looks roughly like:

Name: Brick
Func: Brick
Category: Basis Functions
Input: Position, var="position"

Output: Result, var="brickOut"
Output: Brick ID, var="brickId"

Property: Brick Size, var="brickSize", type="float2", description="The width and height of an individual brick", default="3,1"
Property: Brick Percent, var="brickPct", type="float", description="The percentage of the range that is brick (vs. mortar)", default="0.9"
Property: Brick Offset, var="brickOffset", type="float", description="The horizontal offset of the brick rows.", default="0.5"
 float2 id;
 brickOut = brick(position.xy, brickSize, brickPct, brickOffset, id);
 brickId = id.xyxy;

Above the “%%” is information on how it interacts with the editor, what the output names are (and which vars in the code they correspond to), and what the inputs and properties are.

Inputs are inputs on the actual graph, from previous snippets.  Properties, by comparison, are what show up on the property grid.  I simplified the inputs/outputs by making them always be float4s, which made the shaders really easy to generate.

Then, there’s a template file that is filled in with the generated data.  In this case, the template uses some structure I had in place for the hand-written ones.  The input and output nodes in the graph are based on data from this template, as well.  In my case, the inputs are position and texture coordinates, and the outputs are color and height (for normal mapping).

So a simple graph like this:


…would, once generated, be HLSL that looks like this:

#include "headers/baseshadersheader.fxh"

void FuncConstantVector(float4 constant, out float4 result)
 result = constant;

void FuncScale(float4 a, float factor, out float4 result)
 result = a*factor;

void FuncBrick(float4 position, float2 brickSize, float brickPct, float brickOffset, out float4 brickOut, out float4 brickId)
 float2 id;
 brickOut = brick(position.xy, brickSize, brickPct, brickOffset, id);
 brickId = id.xyxy;

void FuncLerp(float4 a, float4 b, float4 t, out float4 result)
 result = lerp(a, b, t.x);


void ProceduralTexture(float3 positionIn, float2 texCoordIn, out float4 colorOut, out float heightOut)
 float4 position = float4(positionIn, 1);
 float4 texCoord = float4(texCoordIn, 0, 0);

 float4 generated_result_0 = float4(0,0,0,0);
 FuncConstantVector(float4(1, 1, 1, 1), generated_result_0);

 float4 generated_result_1 = float4(0,0,0,0);
 FuncConstantVector(float4(0.5, 0.1, 0.15, 0), generated_result_1);

 float4 generated_result_2 = float4(0,0,0,0);
 FuncScale(position, float(5), generated_result_2);

 float4 generated_brickOut_0 = float4(0,0,0,0);
 float4 generated_brickId_0 = float4(0,0,0,0);
 FuncBrick(generated_result_2, float2(3, 1), float(0.9), float(0.5), generated_brickOut_0, generated_brickId_0);

 float4 generated_result_3 = float4(0,0,0,0);
 FuncLerp(generated_result_0, generated_result_1, generated_brickOut_0, generated_result_3);

 float4 color = generated_result_3;
 float4 height = float4(0,0,0,0);

 colorOut = color.xyzw;
 heightOut = height.x;

#include "headers/baseshaders.fxh"

Essentially, each snippet becomes a function (and again, if you look at FuncBrick compared to the brick snippet from earlier, you can see that the input comes first (and is a float4), the properties come next (With types based on the snippet’s declaration), followed finally by the outputs.  Once each function is in place, the shader itself (in this case, the actual shader is inside of headers/baseshaders.fxh included at the end, but that simply calls the ProceduralTexture function just before that) calls each function in the graph, storing the results in unique values, and passes those into the appropriate functions later down the line.

More Content

In addition to all of that, I have also completed the first draft of level 2’s enemy layout (including the level’s new enemy type), and am working on the boss of the level:


Once I finish the boss of level 2, I’ll start on level 3’s gameplay (and boss).  Once those are done and playable, I’ll start adding the actual backgrounds into them (instead of them being a simple, static starfield).

Bug Tracking

Finally, I’ve started an actual bug tracking setup, based on the easy-to-use Flyspray, which I highly recommend for a quick, easy-to-setup web-based bug/feature tracking thing.

In the case of the database I have, I’ve linked to the roadmap, which is the list of the things that have to be done at certain stages.  I am going to try, this year, to submit my game for the PAX 10 this year, so I have my list of what must be done to have a demo ready (the “PAX Demo” version), and the things that, additionally, I’d really like to have ready (the “PAX Demo Plus” version).  Then, of course, there’s “Feature Complete”, which is currently mostly full of high-level work items (like “Finish all remaining levels” which is actually quite a huge thing) that need to be done before  the game is in a fully-playable beta stage.

In short: I’m still cranking away at my game, just more slowly than I’d like.