networking – JoshJers' Ramblings

Networking Is Hard (Part 3)

In my previous two posts, I started looking into what it would take to code the networking for my game, and came up with a first draft, before realizing that floating-point discrepancies between systems totally threw my lockstepping idea for a loop.

Lockstepping With Collisions

In order to solve the issue with different systems having different floating-point calculation results, I decided to somewhat revamp the network design, and really leverage the fact that I don’t care so much about cheating – you could never get away with a networking scheme like this in a competetive game.

First, the timing of the scrolling, player bullet fire rates, enemy fire rates, etc were all modified to be integer-based instead of float-based. There are no discrepencies in the way that integer calculations happen from machine to machine, so the timings of things like enemy spawning, level scrolling, etc. are now all perfectly in sync, frame by frame.
Next, when the client detects a collision between entities, it sends a message to the server (which, you may recall, is running on the same system as the client – each machine gets one) notifying it of a collision. These messages are also synced across the network.
Thus, whenever an enemy does on any one client, it dies on both servers on the same tick (that is, the “first” client relative to the game clock to detect a collision determines when the collision took place). This means that there are no longer any timing differences between deaths on each server (and if a collision happens to be missed by one client, but hits on the other, it will eventually reach the other client).
Because players were already sending “I died!” flags as part of their network packets, these were already always perfectly in sync, so no change was needed there.

As an added bonus, since all collision detections are now handled by the client (and communicated to the server), the server never has to do any collision detection calculations on its own, which eases up on the CPU load somewhat (previously, the client and server were both doing collision calculations). All the server has to do now is apply collisions reported by either the local client or the remote client.

So now, the actual mechanism by which the game keeps in sync from system to system is set, but how does it handle the three main enemies of the network programmer?

Network Gaming’s Most Wanted #1 – Latency

“Lag” is one of the most dreaded words in the network gaming world. It’s always going to be present – nothing can communicate across the internet faster than the speed of light (and, because of transmission over copper, it’s really more like a sluggish 2/3rds the speed of light!). Routers and switches also add their own delays to the mix. According to statistics gathered by Bungie from Halo 3’s gameplay, most gamers (roughly 90%)end up with a round-trip latency of less than (or equal to) 250ms. That is, it takes a quarter of a second for data to go from System A to System B and back to A. That’s a long time for a fast-action networked game! Thankfully, because messages sent from system to system in this game’s network design are never dependent on messages from the other, nothing has to round trip, so the latencies can effectively be halved, making the system much better at handling lag (because, quite frankly, there’s just less of it!)

As discussed previously, because the client can run ahead of the server and, thus, process local player input immediately, there’s no latency in what the player presses and what actually happens on-screen. But what about how the remote player’s actions look? With a ping under 100ms, there are next to zero visible discrepancies on the system. That is, low-ping games are virtually indistinguishable from locally-played games.

At around the 400ms ping mark, it does start to become obvious that things aren’t quite right – due to the interpolation of remotely-shot bullets, they accelerate up to a certain part of the screen until they reach their known location then slow down to normal speed, which is fairly noticeable (I’m still trying to smooth this out a touch). When enemies get too close to the remote player as it fires, due to the delay, the bullets will collide with the enemy, but the enemy will live longer than it appears it should (because the local client does not reliably know that the remote bullet is actually still alive, it can’t deal actual damage to the enemy, it has to wait for the server to confirm).

Above 1-2 seconds of latency, all bets are off – the local player will find the game still perfectly playable, but the movements of the remote player will be completely erratic, and remotely fired bullets won’t act at all like they should. But, since 90% of gamers have much lower latencies, this is not really an issue. For the majority of gamers, the game will look and play pretty close to how it would if both players were in the same room.

Network Gaming’s Most Wanted #2 – Packet Loss

Latency’s lesser-known brother is packet loss, which is where data sent from one machine to another never makes it (due to routing hardware failure, power outage, NSA interception, alien abduction, etc). On a standard internet connection, you can generally assume that about 10% of the packets that you send will get lost along the way. Also, just because you send Packet A before Packet B doesn’t mean they’ll arrive in the right order – a machine might get a packet sent later before one that was sent earlier.

With the XNA runtime, there are four different methods that you can send packets with (obviously you can mimic these with any networking setup, but I’m using XNA so it’s my frame of reference here):

Unreliable – the other system will get these in potentially any order, or it may not even get them at all. The name says it all – you can’t rely on these packets. This is probably not the best option to use.
In-Order – These packets are for data for which you really only need the most recent data; you only care about the most recent score, for instance – not what the score was in a previous packet. Thus, these packets contain extra version information so that the XNA runtime can ensure that packets that arrive out of order don’t reach you. As soon as a new packet comes in, it becomes available to the game. If a packet that’s older than the most recent one comes in, it’s discarded. You immediately get new ones at the cost of never getting older ones. For many games, this is a perfect scenario.
Reliable – These packets will always arrive. When the XNA runtime receives one of these, it sends an acknowledgement to the other system that it received it. If the system that sent it doesn’t receive such an acknowledgement, it’ll resend (and resend and resend and…) until it finally arrives at the destination. Packets sent reliably are not vulnerable to packet loss; if you send it, as long as the connection remains valid you know it will reach the destination eventually. However, these packets may not arrive in the proper order (you may receive Packet C before Packets A or B).
Reliable, In-Order – On the surface, this sounds like the best choice! These packets always arrive in the right order, and they always arrive! That is, you will always get Packets A, B, C, and D, in that order. There’s a hidden downside, though: If the game recieves Packet C, but has not yet received Packets A or B, it has to hold onto that packet until both A and B arrive, which, if they need to be resent, can really ratchet up the latency. Any one packet that needs to be resent will hold up the whole line until it arrives. Clearly, this type of packet should only be used when absolutely necessary – for normal gameplay, it’s better to use In-Order or Reliable on their own.

Eventually, I decided to send packets in the Reliable way, but not In-Order. But, to minimize the amount that the game has to wait for resent packets to arrive, each packet contains eight frames worth of input/collision data. That way, as long as one out of every string of 8 packets arrives, the server will have all of the relevant information to sync up to that point. And if, for some reason, 8 packets in a row are all lost in transmission, they’ll be resent and make it eventually anyhow.

To handle this, the game essentially has a list of frames that it’s received data for (8 of which come in with each packet).

For each frame that a packet contains, if the frame already been simulated by the server (a frame from the past), that frame is ignored.
Similarly, if the frame is already in the list, it’s ignored.
If it’s not a past frame and it isn’t already in the list, add it to the list (in order – the list is sorted from earliest to latest).
After this is done, if the next frame that the server needs to simulate is in the list, remove it from the list and go! Otherwise, wait until it is.

The game doesn’t care which order the frames are received in – as long as it has the next one in the list, it’ll be able to continue on. Because of the redundancy, it rarely has to wait on a resend due to packet loss. In fact, using XNA’s built-in packet loss simulation (thank you, XNA team!), the packet loss has to be increased to over 90% before the latency of the simulation starts to increase (hypothetically, the magic number is above 87.5% packet loss – greater than 7/8 packets lost).

The disadvantage of this system is that it does add to the bandwidth use, as each packet now contains an average of eight times as much data as it would normally, which brings us to…

Network Gaming’s Most Wanted #3 – Bandwidth

Ah, bandwidth. There’s no point in having a low-latency connection between two systems if the game requires too much bandwidth for the connection to keep up. Because the Xbox Live bandwidth requirement is 8KB/s (that’s kilobytes), that became my goal as well.

This is where I overengineered my system a bit. I was estimating, as a worst case, an average of 10 collisions per frame. With packet header overhead, voice headset data, and everything, with 8 frames worth of data in each packet, I expected to be just BARELY below the 8KB/s limit.

When I finally got the system up and running, it turned out the game was using less than 4KB/s. The average amount of collisions per frame is closer to TWO than it is to 10 (unfunny side note: I had the right number, but the wrong numerical base. My answer was perfect in binary), even with a lot of stuff going on (though an individual frame may have many more, there are usually large spaces between collisions as waves of bullets smack into enemies). The most I’ve been able to get it up to with this system is about 5KB/s, which means the game still has a delightful 3KB/s of breathing room. I think I’ll keep it that way!

Final Remarks

Hopefully this has been an informative foray into the design of a network protocol for an arcade-style shoot-em-up game. I’m no network professional – in fact, this is the first network design I’ve ever done, so I’m sure people who do this stuff for a living are laughing at my pathetic framework. If anyone has any suggestions as to how I might improve my network model, I’m all ears – while it works pretty well, I’m always open to ideas!

Networking Is Hard (Part 2)

In my previous post, I weighed the advantages and disadvantages of my game vs. a standard FPS with regard to networking. After doing so, I came up with an initial network design.

The biggest issue, personally, was dealing with the causality of the network game. Each player gets information about the other with a delay, so neither player is ever seeing exactly the same set of circumstances (perfectionism and network lag don’t mix very well). I think Shawn Hargreaves describes it best in one of his networking presentations: he says to treat each player’s machine as a parallel world. They’re not exact, but the idea is to try to make them look as close as possible. It doesn’t matter if things don’t happen exactly the same, but you don’t ever want the following conversation to occur:

Player 1: “Wow, can you believe I killed that giant enemy at the last second? That was amazing!”

Player 2: “…What giant enemy?”

Procyon’s Initial Design

The design I started with was a hybrid of both the peer-to-peer lockstep and client-server models.

networkmodel1

Basically, within each system there is a client/server pair. In addition, each server runs in lock-step with the other, so that they both always tick with the exact same player inputs for a given frame, keeping the two machines’ servers perfectly in sync. Since the only data being sent across the network is player inputs, bandwidth use was ridiculously low. And since nothing ever has to round-trip from the client to the server and back, the system still gets the effectively-halved-ping that a pure lockstep setup gets. Also, because the client and the server are running on the same machine, communications between the two (mostly, the server notifying the client of events that happened based on the other player’s input) becomes very trivial – the server can just call callbacks to the client, and no sort of RPC mechanism is necessary.

Ignoring the clients for a moment, the servers are a very traditional lockstep system. Each normal tick, the player input is polled and then sent across the network. Once the server receives the input from the remote system for the next frame, it ticks forward. As you can imagine, each machine’s server is a bit lagged, because it has to wait for inputs across the network. That’s where the client comes in.

Each system also runs a client, which runs ahead of the server. This client always ticks every (in Procyon’s case) 60th of a second, processing the local player’s inputs right away, and running prediction on the remote player based on the last set of inputs and the last known position that the server knows about. That way, the local player’s ship always reacts right away to player inputs (on the client, which is represented on-screen).

Essentially:

The current local player’s input is polled and sent to the server.
The input is also sent immediately across the network to the other machine.
The client ticks using this input – the position and actions of the remote player are predicted based on the last-known input and position from the server.
If the server has the remote input for the next frame that it has to do (which is always an eariler frame than the client), it also ticks (no prediction necessary on the server, as it has up-to-date information about both players for the given frame it’s simulating). Again, there is a server on both machines, but they’ll both end up with the exact same simulation.

(Mostly-) Deterministic Enemies

Here’s where one of the advantages of the game comes into play. In my last post, I mentioned that enemies are deterministic in their movements. This is actually a key part of the networking. What it means is, there is no prediction required for simulating enemy behaviors ahead of the server. For instance, say that the client is currently 5 frames ahead of the server. While the client is ticking frame 450, the server is only on 445 (because it hasn’t received network input for frame 446 from the other machine yet). Even though the server hasn’t simulated enemy movements yet, their movements are very strictly defined, so the client doesn’t have to guess – it knows exactly where an enemy is on frame 450. Consequently, the client running ahead of the server is not a problem with regards to enemy positions – when you see an enemy in a specific place on the screen, you know that’s exactly where it will be on the server when the server finally simulates that same frame.

Unless the other player kills it first.

That’s right – there is exactly one case in which an enemy’s behavior is non-deterministic, and it’s most-easily described as follows: your local client is currently simulating frame 450, the server has simulated frame 445. You see a large cannon ship start to charge up its beam weapon for a massive attack. However, the server gets remote player input from frame 446 (the next frame it has to simulate), and when simulating it, realizes that the other player dealt the killing shot to the ship. Suddenly, the client’s view of that enemy is wrong – it should have died four frames prior.

In essence, the only way that a client’s view of an enemy is ever wrong is if the other player has killed it in the past, and the server hasn’t caught up. This is a very important property: the only time, ever, that an enemy is not where you see it as is when it’s not anywhere at all (because it already died). However, this is where it starts to get tricky.

Most of the time, it’s not a big issue. The client kills the enemy as soon as the server tells the client that the enemy should be dead, and it just dies a few frames too late. But when the enemy has fired bullets (or any other type of weapon), suddenly there’s a problem. What has to happen in that case is that the client has to remove the bullets that shouldn’t really have been spawned. With reasonable pings, the bullets will generally be close enough to the enemy’s death explosion that they won’t really be noticeable, they’ll blink up in the middle of the explosion and disappear by the time it’s done. With larger lag times, though, bullets for dying enemies might spontaneously disappear. Ah, lag. How networked games love you so.

The real problem, though, is when, using the frame numbers above, you crash your ship into an enemy on tick 450 that actually died on tick 446. What then? You crashed into something that shouldn’t even have been there! After much internal debate, I decided there’s a really simple way to arbitrate this: on your screen, you crashed, so you still get the consequences. Even though you hit something (either a should-have-been-dead enemy or a bullet spawned from such an enemy) that shouldn’t have even been there, you still ran into something on your screen, and still pay the price. So, in addition to player inputs that get sent to the server (and across the network), the client also sends a flag that says “I died!”

As an added bonus, the server no longer needs to calculate collisions against either player – when a player dies, the client signals it (This plays into another advantage of my game – in a competetive multiplayer game, you would never, ever trust a client machine to tell you whether or not its player got hit by something).

Random Number Generation

One issue has to do with random number generation. Obviously, when connecting the two machines across the network, the machines have to agree on a random number seed so that random numbers generated on each are the same. In fact, not only do the machines on either side of the network have to have the same random number seeds, but the client and server also have to start with the same seeds, or they’ll get out of sync (a machine getting out of sync with itself is always funny).

However, what happens when an enemy (Smallship 5) shoots a bullet in a random direction, and then dies “in the past” based on a server correction? On the server(s), this random number generation never happened. On the client, however, a bullet was shot, and a random number was created. With a single random number generator, that means all of the random numbers from that point on on the client are going to be incorrect. Thankfully, there’s an easy way around that.

Give every enemy its own random number generator!

I found a very lightweight (and very high-quality) random number generator: Marsaglia’s mutliply-with-carry generator (or, if you prefer, the infinitely more difficult-to-read Wikipedia version). This generator only requires two unsigned integers for each generator to do its magic, so there wasn’t much overhead to hand one of these off to each enemy. So, the initial random number seed is decided upon before the level begins to load. As the level begins to load, and enemies are created, a new generator is created for each entity, using random seeds generated from the original generator. This way, each enemy gets its own generator, and if an enemy ever generates any extra random numbers before it dies, it doesn’t affect any of the other enemies at all! Problem solved.

Remote Player Bullets

Again, pretend the client is ticking frame 450. When the server reaches 445, it notices that the remote player fired a bullet. So it notifies the client: “Hey, 5 frames ago, the other player fired a shot!.”

The client knows that it needs to spawn a new remote player bullet.
It also knows exactly where that bullet will be at the current tick (since it was spawned at tick 445, but it’s now tick 450, it can tick the bullet forward 5 frames).
It does not, however, know that the bullet will actually survive until tick 450. It’s possible that anywhere from tick 446 until 449, that it might hit something.

Even though it knows exactly where the bullet should be, it doesn’t display it there. It actually starts the bullet in front of its current estimate of where the remote player is (the predicted position), and interpolates it into the correct spot. That way, remote player’s bullets don’t seem to appear way out in front of the remote player, they still start right where they seem like they should.

The client knows exactly where the bullet should be, but not that it’s actually there (because it might have died on, say, tick 448). So while it does collision detection against enemies, if the remote bullet hits an enemy, it doesn’t actually do any damage on the client – it just removes the bullet. If it turns out it dealt damage after all, the server will eventually send a correction to the client.

However, eventually, the server reaches frame 450 (when the client first learned about the bullet). If the bullet is still alive on the server, then we know that it never hit anything from frame 445 until then, so it was a live bullet when the client found out about it. Also, if it’s still alive on the client (it didn’t collide with anything while the server caught up to frame 450), that means the client knows that the bullet is still alive.

Now that it knows that it has a bullet that is guaranteed to still be alive, and is at the exact position that it’s supposed to be, the bullet can flip from being treated as a remote bullet to being treated exactly like a locally-fired bullet. Basically, this bullet now will deal damage and act exactly like a bullet fired by this machine’s own player! It’s one less thing that will act different on the other machines, further strengthening the illusion that both players are seeing the same thing.

Floating Points Are Sharper Than Expected

However, there was one big issue with this entire design: floating-point errors. Due to minute differences in the way floating point numbers are calculated on different systems (due to different CPUs, code optimizations, quantum entanglement, etc), the collision code on each system acted slightly differently. Consequently, the two servers running in lockstep weren’t actually in sync. This was causing all sorts of issues – slight timing differences on enemy deaths caused discrepencies in scores and energy amounts…and the real kicker is that it was possible that two objects would collide on one system and never collide at all on the other simulation (a grazing collision on one might not trigger as a collision on the other system). Added bonus when an enemy died on one server just after shooting a ton of bullets, but just before shooting them on the other server. Suddenly, one player’s screen would be full of enemy bullets, and the other’s would be clear as the summer sky.

This became a real problem, and it took a while to solve it (again, perfectionism and networking don’t mix)…and next time, I’ll present the solution.

Networking Is Hard (Part 1)

I haven’t had much development time in the last (nearly two) month(s), but I have had time to nearly finish up the networking code for the game, so I thought I would describe some of the work that went into the design of the system.

Resources

I tried to find some references on how most shmups (shoot-em-ups) such as my current game handle their networking, but I didn’t really find anything. There are a lot of resources on how RTS games do it (such as this great article about Age of Empires’ networking), and even more articles about how to write networking for an FPS game (the best of which, in my opinion, are the articles about the Source engine’s networking and the Unreal engine’s networking). There’s also a great series of articles on the Gaffer on Games site. And no list of references would be complete without the Gamedev.net Multiplayer and Networking forum.

One of the most seemingly-relevant things to my game that I read was that, in current system emulators (for instance, SNES emulators) that have network play, they work in lock step with each other…that is, each system sends the other system(s) its inputs, and when all inputs from all systems have arrived, they can tick the simulation forward. I did an early experiment to see how well this would work with my game, and it turns out, not so well – the latency introduced with even a moderate ping (100ms) is rather prohibitive. The player would press up on the gamepad and have to wait for the ship to start moving…and it would only get worse with higher latencies! Obviously, this wouldn’t work.

Since there were many articles about the FPS model of networking, I opted to use it as a starting point, as my game is also a fast-action game. First off, I decided to make a (partial) list of the advantages and disadvantages of my game type vs. the FPS model:

Advantages of Procyon vs. an FPS

Only two players are supported, which greatly simplifies the bandwidth restrictions and the communication setup.
There’s no need to support joining/quitting in the middle of a game (again, simplifying game communications and state management)
Since it’s cooperative, I’m not really worried about players cheating, so I can allow clients to be authoritative about their actions moreso than the average FPS can.
Players can’t collide with each other, so there’s no need to worry about such things (in fact, a player basically can’t affect the other player at all – a player can only affect enemies)
Enemies are deterministic in their movements – none of my enemy designs have movement variation based on the player’s actions, so collision vs. enemies in a given frame is always accurate (unless the enemy was dead “in the past” due to the actions of the other player, which is discussed later).

Disadvantages

In an FPS, the view is more limited – it’s rare that a player sees all of the action at once due to the limited field of view. However, in Procyon, everything that is happening in the game is completely visible on both screens, meaning the game generally has to be less lenient about discrepancies.
Most FPS games are “instant hit.” For instance, when you fire a pistol in a FPS game, there’s no visible bullet that streaks across the screen, so it’s easier to fudge the results on the server. In Procyon, however, all of the weapons are on-screen and visible, so fudging them in a minimally-noticeable way becomes very difficult.
In many FPS games (though not all), there are WAY less entities active at a time than there are in a shoot-em-up.
This falls into the realm of “not a lot of references for this type of game,” but many of the “cheats” employed by FPS game programmers are really well-known, but very domain-specific; with a shmup, I’d have to make up my own network fakery to hide latency.

Network Strategies

There are two main network strategies used in games (at least, in those that I researched): lockstep and client-server.

Lockstep keeps all systems involved perfectly in sync, though at the cost of making it more susceptible to latency, because the game doesn’t step until the other player’s inputs have traversed the network. Generally, every so often (say, every 1/30th of a second), the current input is polled and then sent across the network. Then, once both/all inputs have reached the local machine, it ticks the game. This is the type of networking used by emulators (out of necessity – there’s no reliable way to do any sort of prediction with an emulator anyway) and some RTS games (such as Age of Empires, linked above). RTS games can hide the latency a bit by causing all commands to execute with a bit of a delay. One interesting property of lockstep networking is that nothing ever has to round-trip across the network. That is, nothing that server B ever sends back relies on something that Server A sent previously. Because of this, the game’s latency is effectively halved, as the meaningful measurement is one-way latency, not a full there-and-back ping. Lockstep games are frequently done peer-to-peer: each system sends its inputs to each other system…there’s no central entity in charge of all communications.

Conversely, a client-server architecture (which is used by an overwhelming majority of FPS games) puts one system in charge and generally gets authority over the activities of all players. In this case, the client sends its information to the server (inputs, positions, etc), and the server sends back any corrections that need to be made (locations/motion of the other players, shots fired, things hit/killed, etc). The client can run ahead of the server based on a prediction model, where it gives its best guess as to where things are currently based on the last communication from the server. This makes game lag much less apparent (At least, in the player’s own movements), as the player’s inputs can be processed immediately on the local machine.

In the next installment, I’ll talk through Procyon’s initial network design, which is a bit of a hybrid between the client-server and lockstep architectures…stay tuned!