XNA – JoshJers' Ramblings

Networking Is Hard (Part 3)

In my previous two posts, I started looking into what it would take to code the networking for my game, and came up with a first draft, before realizing that floating-point discrepancies between systems totally threw my lockstepping idea for a loop.

Lockstepping With Collisions

In order to solve the issue with different systems having different floating-point calculation results, I decided to somewhat revamp the network design, and really leverage the fact that I don’t care so much about cheating – you could never get away with a networking scheme like this in a competetive game.

First, the timing of the scrolling, player bullet fire rates, enemy fire rates, etc were all modified to be integer-based instead of float-based. There are no discrepencies in the way that integer calculations happen from machine to machine, so the timings of things like enemy spawning, level scrolling, etc. are now all perfectly in sync, frame by frame.
Next, when the client detects a collision between entities, it sends a message to the server (which, you may recall, is running on the same system as the client – each machine gets one) notifying it of a collision. These messages are also synced across the network.
Thus, whenever an enemy does on any one client, it dies on both servers on the same tick (that is, the “first” client relative to the game clock to detect a collision determines when the collision took place). This means that there are no longer any timing differences between deaths on each server (and if a collision happens to be missed by one client, but hits on the other, it will eventually reach the other client).
Because players were already sending “I died!” flags as part of their network packets, these were already always perfectly in sync, so no change was needed there.

As an added bonus, since all collision detections are now handled by the client (and communicated to the server), the server never has to do any collision detection calculations on its own, which eases up on the CPU load somewhat (previously, the client and server were both doing collision calculations). All the server has to do now is apply collisions reported by either the local client or the remote client.

So now, the actual mechanism by which the game keeps in sync from system to system is set, but how does it handle the three main enemies of the network programmer?

Network Gaming’s Most Wanted #1 – Latency

“Lag” is one of the most dreaded words in the network gaming world. It’s always going to be present – nothing can communicate across the internet faster than the speed of light (and, because of transmission over copper, it’s really more like a sluggish 2/3rds the speed of light!). Routers and switches also add their own delays to the mix. According to statistics gathered by Bungie from Halo 3’s gameplay, most gamers (roughly 90%)end up with a round-trip latency of less than (or equal to) 250ms. That is, it takes a quarter of a second for data to go from System A to System B and back to A. That’s a long time for a fast-action networked game! Thankfully, because messages sent from system to system in this game’s network design are never dependent on messages from the other, nothing has to round trip, so the latencies can effectively be halved, making the system much better at handling lag (because, quite frankly, there’s just less of it!)

As discussed previously, because the client can run ahead of the server and, thus, process local player input immediately, there’s no latency in what the player presses and what actually happens on-screen. But what about how the remote player’s actions look? With a ping under 100ms, there are next to zero visible discrepancies on the system. That is, low-ping games are virtually indistinguishable from locally-played games.

At around the 400ms ping mark, it does start to become obvious that things aren’t quite right – due to the interpolation of remotely-shot bullets, they accelerate up to a certain part of the screen until they reach their known location then slow down to normal speed, which is fairly noticeable (I’m still trying to smooth this out a touch). When enemies get too close to the remote player as it fires, due to the delay, the bullets will collide with the enemy, but the enemy will live longer than it appears it should (because the local client does not reliably know that the remote bullet is actually still alive, it can’t deal actual damage to the enemy, it has to wait for the server to confirm).

Above 1-2 seconds of latency, all bets are off – the local player will find the game still perfectly playable, but the movements of the remote player will be completely erratic, and remotely fired bullets won’t act at all like they should. But, since 90% of gamers have much lower latencies, this is not really an issue. For the majority of gamers, the game will look and play pretty close to how it would if both players were in the same room.

Network Gaming’s Most Wanted #2 – Packet Loss

Latency’s lesser-known brother is packet loss, which is where data sent from one machine to another never makes it (due to routing hardware failure, power outage, NSA interception, alien abduction, etc). On a standard internet connection, you can generally assume that about 10% of the packets that you send will get lost along the way. Also, just because you send Packet A before Packet B doesn’t mean they’ll arrive in the right order – a machine might get a packet sent later before one that was sent earlier.

With the XNA runtime, there are four different methods that you can send packets with (obviously you can mimic these with any networking setup, but I’m using XNA so it’s my frame of reference here):

Unreliable – the other system will get these in potentially any order, or it may not even get them at all. The name says it all – you can’t rely on these packets. This is probably not the best option to use.
In-Order – These packets are for data for which you really only need the most recent data; you only care about the most recent score, for instance – not what the score was in a previous packet. Thus, these packets contain extra version information so that the XNA runtime can ensure that packets that arrive out of order don’t reach you. As soon as a new packet comes in, it becomes available to the game. If a packet that’s older than the most recent one comes in, it’s discarded. You immediately get new ones at the cost of never getting older ones. For many games, this is a perfect scenario.
Reliable – These packets will always arrive. When the XNA runtime receives one of these, it sends an acknowledgement to the other system that it received it. If the system that sent it doesn’t receive such an acknowledgement, it’ll resend (and resend and resend and…) until it finally arrives at the destination. Packets sent reliably are not vulnerable to packet loss; if you send it, as long as the connection remains valid you know it will reach the destination eventually. However, these packets may not arrive in the proper order (you may receive Packet C before Packets A or B).
Reliable, In-Order – On the surface, this sounds like the best choice! These packets always arrive in the right order, and they always arrive! That is, you will always get Packets A, B, C, and D, in that order. There’s a hidden downside, though: If the game recieves Packet C, but has not yet received Packets A or B, it has to hold onto that packet until both A and B arrive, which, if they need to be resent, can really ratchet up the latency. Any one packet that needs to be resent will hold up the whole line until it arrives. Clearly, this type of packet should only be used when absolutely necessary – for normal gameplay, it’s better to use In-Order or Reliable on their own.

Eventually, I decided to send packets in the Reliable way, but not In-Order. But, to minimize the amount that the game has to wait for resent packets to arrive, each packet contains eight frames worth of input/collision data. That way, as long as one out of every string of 8 packets arrives, the server will have all of the relevant information to sync up to that point. And if, for some reason, 8 packets in a row are all lost in transmission, they’ll be resent and make it eventually anyhow.

To handle this, the game essentially has a list of frames that it’s received data for (8 of which come in with each packet).

For each frame that a packet contains, if the frame already been simulated by the server (a frame from the past), that frame is ignored.
Similarly, if the frame is already in the list, it’s ignored.
If it’s not a past frame and it isn’t already in the list, add it to the list (in order – the list is sorted from earliest to latest).
After this is done, if the next frame that the server needs to simulate is in the list, remove it from the list and go! Otherwise, wait until it is.

The game doesn’t care which order the frames are received in – as long as it has the next one in the list, it’ll be able to continue on. Because of the redundancy, it rarely has to wait on a resend due to packet loss. In fact, using XNA’s built-in packet loss simulation (thank you, XNA team!), the packet loss has to be increased to over 90% before the latency of the simulation starts to increase (hypothetically, the magic number is above 87.5% packet loss – greater than 7/8 packets lost).

The disadvantage of this system is that it does add to the bandwidth use, as each packet now contains an average of eight times as much data as it would normally, which brings us to…

Network Gaming’s Most Wanted #3 – Bandwidth

Ah, bandwidth. There’s no point in having a low-latency connection between two systems if the game requires too much bandwidth for the connection to keep up. Because the Xbox Live bandwidth requirement is 8KB/s (that’s kilobytes), that became my goal as well.

This is where I overengineered my system a bit. I was estimating, as a worst case, an average of 10 collisions per frame. With packet header overhead, voice headset data, and everything, with 8 frames worth of data in each packet, I expected to be just BARELY below the 8KB/s limit.

When I finally got the system up and running, it turned out the game was using less than 4KB/s. The average amount of collisions per frame is closer to TWO than it is to 10 (unfunny side note: I had the right number, but the wrong numerical base. My answer was perfect in binary), even with a lot of stuff going on (though an individual frame may have many more, there are usually large spaces between collisions as waves of bullets smack into enemies). The most I’ve been able to get it up to with this system is about 5KB/s, which means the game still has a delightful 3KB/s of breathing room. I think I’ll keep it that way!

Final Remarks

Hopefully this has been an informative foray into the design of a network protocol for an arcade-style shoot-em-up game. I’m no network professional – in fact, this is the first network design I’ve ever done, so I’m sure people who do this stuff for a living are laughing at my pathetic framework. If anyone has any suggestions as to how I might improve my network model, I’m all ears – while it works pretty well, I’m always open to ideas!

Collision Detection Performance (Volume 1)

I have been hard at work on my game (in my ridiculously limited spare time) for the last month and a half. One major hurdle that I’ve had to overcome was collision detection code. Specifically, my collision detection performed great on my PC, but when running it on the Xbox 360, everything would slow to a crawl (in certain situations).

The types of collision detection I have to deal with are varied, due to the weird way that I handle certain classes of obstacle (like walls):

Player bullets vs. Enemy – Player bullets are, for simplicity, treated as spheres, so sphere/mesh testing works here.
Enemy bullets vs. Player – Same as above.
Player vs. Wall – Because the game’s playing field is 2D, the walls in-game are treated as 2D polygons, so it boils down to a 2D mesh vs. polygon test.
Player vs. Enemy – Mesh vs. Mesh here
Beam vs. Enemy – The player has a bendy beam weapon. I divide the curved beam up into line segments, and do ray/mesh tests.

The worst performance offender was, surprisingly, the sphere vs. mesh test, which will be the subject of this article. Before optimizing, when I’d shoot a ton of bullets in a certain set of circumstances, the framerate would drop well into the single digits, because the bullet vs. mesh (sphere vs. mesh) collision couldn’t keep up. Here are the things that I changed to get this test working much, much faster.

When using Value Types, Consider Passing As References

One thing that was slowing my code down was all of the value type copying that my code was doing. Take the following function:

public static bool SphereVsSphere(Vector3 centerA, float radiusA, Vector3 centerB, float radiusB)
{
  float dist = radiusA+radiusB;

  Vector3 diff = centerB-centerA;
  return diff.LengthSquared() < dist*dist;
}

Simple, yes? This function, however, falls prey to reference type copying. You see, “centerA” and “centerB” are both passed in by value, which means that a copy of the data is made. It’s not an issue when done infrequently, but with the number of SphereVsSphere calls that were happening during a given frame, the copies really started to add up.

There’s also a hidden set of copies: the line “Vector3 diff = centerB-centerA” also contains a copy, as it passes centerB and centerA into the Vector3 subtraction operator overload, and they get passed in by value. Also, a new Vector3 gets created inside of the operator then returned, which, I believe, also copies the data into diff.

To eliminate these issues, you should pass all of your non-basic value types (that is, anything that’s not an int, bool, float, anything like that) by reference instead of by value. This eliminates all of the excess copies. It does come at a price, though: in my opinion, it does make the code considerably uglier.

Here’s the updated routine:

public static bool SphereVsSphere(ref Vector3 centerA, float radiusA, ref Vector3 centerB, float radiusB)
{
  Vector3 diff;
  Vector3.Subtract(ref centerA, ref centerB, out diff);
  float dist = radiusA+radiusB;

  return diff.LengthSquared() < dist*dist;
}

Instead of having a nice-looking overloaded subtraction, now there’s a call to Vector3.Subtract. While it’s not so bad in the case of a simple subtraction, when you have a more complicated equation, they pile up pretty quickly. However, given the speed boost just making this change can give you, it’s totally worth it.

Use Hierarchical Collision Detection (But Use a Good Bounding Volume)

Heirarchical collision detection is a good thing.

For those of you that DON’T know, basically instead of testing your collider against every triangle in a mesh, you have a tree in which each node has a bounding volume, and the leaves contain the actual triangles. The idea is that, by doing a much simpler collider vs. bounding volume test, you can elminiate large amoungs of triangles before you ever have to test them.

In this case, I was using a sphere tree, where each node in the tree has a bounding sphere, and the leaves of the tree contain actual mesh triangles. I used spheres instead of AABBs (Axis-aligned bounding boxes) because transforming AABBs is expensive (and they become Oriented bounding boxes after the transform). Transforming a sphere is easy, however. None of my object transforms have scale data, so it’s a simple matter of transforming the sphere’s center.

However, the use of bounding spheres has a dark side. Unless all of your heirarchy levels is roughly sphere-shaped, a sphere is a terribly inefficient bounding volume. They’re usually much larger than the geometry that they contain, so there are more recursions into lower levels of the tree (think of there as being more dead space around the geometry when using spheres than bounding boxes).

By also adding bounding boxes to the data, I could use them where I’m not having to transform them. For instance, because this is sphere vs. mesh, and the entire mesh is rigid, I can take the mesh’s world 4×4 matrix, and transform the sphere by the INVERSE of it. This way, the sphere is in model space, and I can use the bounding volumes without having to do any transformations at lower levels.

But now I needed a sphere vs. AABB test. However, I didn’t much care if it was exact or not, so instead I used a simple test where I expand the box by the radius of the sphere, then test whether the sphere is inside of the box or not. Near the corners (surely this is where the term “corner case” comes from), this can give false positives, but it will never say the sphere DOESN’T intersect the box when it should say it does. This is an acceptable trade-off.

public static bool SphereVsAABBApproximate(ref Vector3 sphereCenter, float sphereRadius, ref Vector3 boxCenter, ref Vector3 boxExtent)
{
  Vector3 relativeSphereCenter;
  Vector3.Subtract(ref sphereCenter, ref boxCenter, out relativeSphereCenter); // Get the sphere center relative to the box's center.
  Vector3Helper.Abs(ref relativeSphereCenter); // Per-component absolute value.

  return (relativeSphereCenter.X <= boxExtent.X+sphereRadius && relativeSphereCenter.Y <= boxExtent.Y+sphereRadius && relativeSphereCenter.Z <= boxExtent.Z+sphereRadius);
}

Simple, but effective. Converting from using a sphere bounding volume to AABBs cut down the number of recursions (and triangle comparisons) being done dramatically, since the AABBs are a much tighter fit to the geometry.

Recursion Is Weird

One suggestion I got, also, was to eliminate recursion. The heirarchical nature of the algorithm meant that my test was recursive. Here was the test as originally written:

public static bool SphereVsAABBTree(ref Vector3 sphereCenter, float sphereRadius, CollisionTreeNode node)
{
  if (!SphereVsAABBApproximate(ref sphereCenter, sphereRadius, ref node.BoxCenter, ref node.BoxExtent))
    return false; // No collision with this node, return false

  if (node.Left != null)
  {
    // This means that there are Left and Right children (either they're both null, or both set).
    if(SphereVsAABBTree(ref sphereCenter, sphereRadius, node.Left))
      return true;
    return SphereVsAABBTree(ref sphereCenter, sphereRadius, node.Right); // A node with child nodes can't have triangles, so just return this result.
  }

  // This node has triangles, so test against them.  If any of them intersects the sphere, return success.
  for(int i = 0; i < node.Indices.Length; i+=3)
  {
    if(SphereVsTriangle(ref sphereCenter, sphereRadius, ref node.Vertices[node.Indices[i+0]], ref node.Vertices[node.Indices[i+1]], ref node.Vertices[node.Indices[i+2]]))
      return true;
  }
  return false;
}

As you can see, it recurses into child nodes until it either gets a false test out of both of them, or it reaches triangles. But how do you eliminate recursion in a tree such as this? More specifically, how do you do it while using a constant (non-node-count dependent) amount of memory?

The trick is as follows: Assuming your nodes contain Parent pointers in addition to Left and Right pointers (where the Parent of the trunk of the tree is null), you can do it with no issue. You track the node that you’re currently visiting (“cur”), and the node that you previously visited (“prev”, initialized to null). When you reach a node, test as follows:

if you came to it via its parent (that is, prev == cur.Parent), you’ve never visited it before.
- At this point, you should do your collision tests. It’s a newly-visited node.
- prev = cur
- cur = cur.Left — This is basically “recursing” into the Left node.
If you arrived from its Left child, you visited it before and recursed into its left side
- Since this node has already been visited, do not do the collision tests. They’ve already been shown to be successful.
- prev = cur
- cur = cur.Right — Since we recursed into Left last time we were here, recurse into Right this time.
If you arrived via its Right child, you’ve visited both of its children, so you are done with this node.
- Again, this node has already been visited, so do not run the collision tests.
- prev = cur
- cur = cur.Parent — We’re done with this node, so go back to its parent.
When doing the collision tests, if a Sphere vs. AABB bounding volume test ever fails, we don’t have to “recurse” so go back to its parent.
- prev = cur
- cur = cur.Parent
Finally, if you do a Sphere vs. Triangle collision test and it succeeds, we can immediately return, as we have a guaranteed collision, and no more triangles or nodes need to be tested.

Doing all of this makes the routine bigger, but no recursion is necessary, so there’s no additional stack space generated per node visited (and no function call overhead, either). The finished code is as follows (note that I also added a quick sphere vs. sphere test right at the outset, because it’s a very quick early out if the sphere is nowhere near the mesh):

public static bool SphereVsAABBTree(ref Vector3 sphereCenter, float sphereRadius, CollisionTreeNode node)
{
  CollisionTreeNode prev = null, cur = node;

  // At the top level, just do a sphere/sphere test for a super-quick out.
  if (!SphereVsSphere(ref sphereCenter, sphereRadius, ref node.Sphere.Center, node.Sphere.Radius))
    return false;

  while(cur != null)
  {
    if(prev == cur.Parent) // Only do the tests if we JUST got here.
    {
      if (!SphereVsAABBApproximate(ref sphereCenter, sphereRadius, ref cur.BoxCenter, ref cur.BoxExtent))
      {
        // No intersection?  Go ahead and just back out of this node now.
        prev = cur;
        cur = cur.Parent;
        continue; // By continuing, we bypass the rest of this code and re-visit the parent immediately.
      }

      for(int i = 0; i < cur.Indices.Length; i+=3)
      {
        if(SphereVsTriangle(ref sphereCenter, sphereRadius, ref cur.Vertices[cur.Indices[i+0]], ref cur.Vertices[cur.Indices[i+1]], ref cur.Vertices[cur.Indices[i+2]]))
          return true;
      }
    }

    // "Recurse"
    if (cur.Left != null)
    {
      if (prev == cur.Parent) // If this is the first visit to the node, recurse left.
      {
        prev = cur;
        cur = cur.Left;
        continue;
      }
      if (prev == cur.Left) // If this is the second visit, recurse right.
      {
        prev = cur;
        cur = cur.Right;
        continue;
      }
    }

    // If there are no child nodes or prev == cur.Right, return to the parent.
    prev = cur;
    cur = cur.Parent;
  }
  return false;
}

Mission Complete

After making that set of changes, the sphere vs. mesh tests no longer bog down on the Xbox, even in a degenerate case such as when there are tens or hundreds of bullets well inside of the mesh’s area.

Getting the sphere vs. mesh test working was a great accomplishment, but as much as I thought it was already working well, it turns out that mesh vs. mesh testing was a much bigger problem. However, that’s another story for another day.

Understanding Half-Pixel and Half-Texel Offsets

For those of you not using Direct3D 9 or XNA, you can safely ignore this post (OpenGL and Direct3D 10 are immune to this particular oddity). However, if you are, it’s likely that you’ve had to deal with the dreaded half-texel offset. Today, after I don’t know how many years of using Direct3D, I came to realize that I really didn’t understand what the source of the issue was. Now that I’ve sort of gotten a handle on it, I figured I’d post it to my super new journal. Consider it a test run.

Coordinate Spaces

The first thing to note is the basic coordinate space. I’m going to be referring to texture space and clip space a lot, so I thought I’d just do a quick refresher here on what I mean (mostly to make sure you’re thinking with the same terminology that I’m using).

Clip space – the post-projection half-cube of space where X,Y in [-1..1] and Z in [0..1]
- X = -1 is the far left edge of the screen
- X = 1 is the far right edge
- Y = -1 is the bottom edge
- Y = 1 is the top
- Z = 0 is near (the near plane)
- Z = 1 is far (the far plane)
Texture space – the area where u,v in [0..1] on a texture map making up a single end-to-end repeat of the texture.
- 0,0 represents the upper-left coordinate on the texture
- 1,1 is the lower-right.

For simplification, we can ignore the Z direction in clip space, since this article is really about the screen-space aspects of it.

Texture Sampling

Texture sampling is pretty straightforward.

In this diagram (and future diagrams), I’m using a 4×4 texture. The [0..1] range in both X and Y map exactly to the range of the texture. Note, here, that the center of the upper-left texel does not lie at 0,0; instead, 0,0 refers to the upper-left corner of that texel. This is an important thing: if you are using bilinear filtering and you want to sample color data from a single texel, you need to sample from the center of the texel instead of the corner. To do that, you’d sample from a half-texel in (that is, for this 4×4 texel, you’d want to sample from (0.125, 0.125), because a texel’s width in texture space is .25 (each texel is 1/4 of texture space), and half of that is 0.125).

A side effect of this is that, if you have bilinear sampling and wrapping on, sampling along the left edge gives you exactly the same data as sampling along the right edge at the same y coordinate. That is, in both cases, you get a perfect blend of the texels on either end of the texture. This is true for the borders between texels in general: sampling where the “dot” is on the image above is the only way to get a pure sample of a texel with bilinear filtering. Anywhere else will be a blend between two. Edges (the lines) are where the two texels neighboring the edge are weighted evenly.

The upshot of this is that the official center of a texel is located where the center SHOULD be: half of a texel from the edge. However, as we’ll see in a moment, the rule for sampling texels does not apply to pixels.

Pixel Coordinates Are Weird And Unintuitive

When referring to “pixel coordinates” vs. “texel coordinates,” pixel coordinates are, for lack of a better way to explain it, the coordinates used when rendering to the screen (or a render target). The pixels are the destination coordinates.

This is where clip space comes in: When you render your geometry, it goes through the gauntlet of matrices, ending with the projection matrices, and ends up in clip space (with x and y both being in [-1..1] for the visible area of the screen).

This diagram is the way that clip space maps to pixels. note that the upper-left clip space coordinate (-1, +1) is actually RIGHT ON the pixel center. This is the root cause of the whole offset problem.

Basically, say you have a screen-sized texture. if you draw it to the screen using a full-screen quad (clip space from upper-left (-1,1) to lower-right (1, -1)), the upper-left pixel would sample the very upper-left corner of your quad, which would give you texture coordinates of (0,0). But remember: that’s not the center of the texel! It’s actually the upper-left corner of the texel. With bilinear filtering, instead of getting a pure sample of your texture, you have a perfect blend of all 4 surrounding texels.

In simpler terms: the half-pixel/half-texel offset exists because there’s a discrepancy between how the centers of pixels and the centers of texels are computed.

To fix this, you could simply offset your texture coordinates by a half-texel before sampling (that is, when your shader gets its uv coordinate, you’d add a half texel ( (.125, .125) for our 4×4 example) before sampling, which would then give you a perfectly lined-up texture sample, and you’d get a great 1:1 mapping of pixels to texels.

However, there’s another problem. What happens when you’re using multisample anti-aliasing (MSAA)?

Weird Pixel Centers and MSAA

So, we now know that the center of the origin pixel is actually the upper-left coordinate of clip space. This causes problems when MSAA is turned on. With MSAA, the geometry is sampled multiple times per pixel, using a set of points that exist within the pixel’s square (while a pixel is not really a square, but a point sample within a square, MSAA effectively treats a pixel as multiple samples in a square).

As an (important) aside: with MSAA on, textures are still only sampled once, from the pixel center. MSAA does not affect the UV coordinates you get for a given pixel, it merely affects geometry coverage.

Continuing with our example, here’s a hypothetical, simple MSAA strategy wherein a pixel is sampled four times (4x MSAA) in an aligned grid:

This diagram illustrates the issue with the standard [-1..1] full screen quad, D3D’s choice of pixel center, and MSAA. Even with a full-clip-space (and, thus, traditionally full-screen) quad, along the top and left edges, some samples no longer touch the quad, and thus the edges are not entirely filled in (a pure white quad drawn to a render target that had been cleared to black would have grey edges along the top and left).

You’ve likely seen this in a ton of games: when you turn on full-screen anti-aliasing, there’s a weird line down the side of the screen during fadeouts and the like. This is the pixel offset problem rearing its head.

How do you fix this? Turns out, it’s simple. When drawing a full-screen quad, instead of adding a half-texel to the uv coordinates, you should instead SUBTRACT a half-pixel from the position (not the uv coordinates). This shifts the quad so that it lies perfectly within the grid in the diagram, so that every MSAA sample hits the geometry. As an added bonus, it means that the UV samples that you get will already be in the texel centers; no more adding a half-texel required!

Note, that when I say “subtract a half-pixel from position,” what I mean is “move the position of the quad one half-pixel towards the upper-left. To do this, you actually add (-1/width, +1/height). The reason for the signs (-, +) is that you’re moving left and up. In post-projection space, -x is left, and +y is up. The reason that it’s 1/width instead of 0.5/width is because, in clip space, the coordinates range from -1 to 1 (width 2), and not from 0 to 1 (width 1) like they do in textures, so you need to double the movement to account for that.

Of course, if you’re drawing world geometry and using its screen coordinates to index into a texture map (like, for instance, if you’re using light pre-pass rendering), you’ll still want to add the half-texel instead. In fact, here is a great reference on how to handle this case.

Take Us Home, Article

The half-texel/half-pixel offset is a bizarre feature in Direct3D 9 (and, by extension, XNA). In order to properly handle it when using full-screen-sized textures:

When rendering a full-screen (or otherwise screen-aligned quad), subtract a half-pixel’s size from the output vertex position from your vertex shader. This will ensure both that the texel centers line up with the pixel centers (for proper texture sampling) AND that the quad will play nice along the left and top edges of the screen with MSAA.
When rendering normal geometry, the geometry is already in the proper place (i.e. it already plays fine with MSAA). Consequently, you should add a half-texel to the uv coordinates for your full-screen sample. This will allow you to sample from the texel centers as desired.

While this article refers mostly to full-screen effects, this information is generally more useful when downsampling textures, as you need to know where your sample points on the source texture will actually hit when rendering (and you’ll want bilinear filtering for, for instance, a 4-tap 4×4 average downsample).

Hopefully, if you were as confused by the half-texel offset as I was, this helps clear things up.

Additional references from MSDN:

Directly Mapping Texels to Pixels (Direct3D 9)

Coordinate Systems (Direct3D 10)