Fuzzing Dose Response

programming, rust

This is a culmination of multiple things that finally came together in Dose Response. Long story short, I’ve written a script that plays the game and helps me detect crashes and issues with the replay functionality.

Replay

Since its earliest development, Dose Response supported a deterministic replay. When you start a new game, a replay file is created. It contains a number representing the random seed for that world and all commands the player does. The game is then able to replay whatever happened and always arrive at the same destination.

This has been one of the best decisions I’ve made in this project.

The game is procedurally generated, so if it happens to crash or bug out due to a rare combination of events, I don’t have to hunt around trying figure out how to repeat this. I replay it and get a perfect reproducer of the bug.

Most of my debugging and cosmetic work happens in a make run / poke around / make replay / make changes loop. I’ve added debug information, key shortcuts to pause the replay, step through it turn by turn or run it with no delays. This works perfectly.

Except when it doesn’t.

The Problem

At its core, Dose Response is a realtime game. I’ve always wanted to have animations that run regardless of whether the player moves or not. The most notable of these is the intoxication animation where each tile starts oscillating between two colours, producing an almost hypnotic effect.

This means that even between the player and monster turns, there’s game code being processed. And unless the replay records the precise delay between the pressed keys, there is a fundamental disconnect between the replay and the original game. The player can take multiple seconds between key presses, while the computers can process one every display frame.

And fundamentally, I really like the idea of having the entire gameplay defined by a single number (the seed) and the key presses.

This led to situations where the replay got derailed from the original gameplay, losing all its advantages. At the beginning, these issues happened rarely and were relatively easy to fix, but things got worse as the game became more complex.

It all really blew up when I made the world “infinite” – instead of being generated and processed all in one go, it’s split into chunks that are created as needed. And if these come in in a different order or different time, things derail really quickly.

No more

Of course, since I wasn’t replaying every game I played, I tended to notice the replay desynchronisation when something bad happened only to find out that the replay was useless.

I’d wanted a way to verify the replays automatically for a really long time, but it always seemed like too much work and resulted in a ton of weird questions (wait so how is the game going to be controlled? Do I have to write an AI that knows how to play it? What if we change the game’s systems? It’s still being developed after all and so on).

The last time this has happened (about a month ago) I started thinking about how to solve this once and for all.

Validating the replays

The first step was to add more info into the replay log that can be used to detect when exactly does the replay diverge. Initially, this was just the turn count, the player position and the number of the world chunks currently generated. Later on, I’ve added the positions of all the monsters in the world.

This in an of itself was a huge step forward. One of the problem with debugging replay desync was that it relied on my memory. I’d play the game, then ran a reply, trying to remember whether it matched the reality. Sometimes this was easy, but other times – especially if the desync became obvious only later in the game – it was not.

With the verification in place, we could figure out that a desync happened automatically and get a better information about what exactly went wrong and when.

Drunkard’s walk

But what I wanted ultimately was a script I could run that would just play and verify a game after a game. I’d let it run against a release candidate build and be reasonably confident that the replay still works. It would also be useful for verifying that any changes I did to fix the replay issues actually worked.

But I didn’t want to spend a year turning this into an AI research project.

So my first thought was: can we just generate a “gameplay” file with a bunch of random commands and delays, have the game interpret it and see the results?

I worried that this wouldn’t get us very far, because a lot of the areas in the game are not walkable, so the “player” would not only engage in a drunkard’s walk, but most of it would be spent bumping into the walls (i.e. what me being drunk would probably look like).

The benefits to the solution were that it would be quick and easy to implement. So I gave it a go: a tiny Python script that wrote commands and in a file and an option to read this and treat it as an input instead of the keyboard in the game.

We’re often taught not to give in to our intuitions and instead experiment. Try things out, do careful measurements, collect data and then make conclusions. Because more often than not, our intuition towards complex systems ends up being wrong.

It was totally dead on in this case, though. Even after 300 commands, the player barely drifted further than 10 tiles from their starting position.

Luckily, this was really easy to write (and scrap) so there was little time wasted and if it were enough, the wins would have been huge.

Backends to the rescue

So turns out, the script that generates the input needs to be able to know at least a little of what’s going on. This complicates the idea somewhat (and was why I was putting this off for such a long time) – we’ll need a two-way communication between the test script and the game. This means changes to the game code that could produce other bugs and generally not really be useful for the game itself.

When I was thinking about it this time around though, I’ve realised that all the difficult pieces are basically there already!

A few months back, I’ve decided to replace libtcod with opengl as the rendering backend. But instead of just ripping it out, I wanted to be able to switch between them to compare them side by side. In addition, I’d like to provide a traditional terminal-based backend for people who can’t or don’t want to use the graphical one. This would also make the game accessible to blind people who use screen readers to play.

And so it happened that the game already had a way of working with isolated modules that provided keyboard input and handled the rendering output.

So all we needed to do was to add a new backend that was talking to the Python test script and passed its inputs as keys and the rendering data as surroundings information.

This was more complicated than the “generate all input upfront” idea, but not by much and it wouldn’t affect the main game’s code. Instead, it would be just another input/output backend that can be compiled out without the rest of the game caring.

So all I had to figure out was a good way of having two processes talk to each other (the game is in Rust, the script is in Python although I would have used whatever worked). Turns out, there’s this thing called ZeroMQ that’s easy to learn and that’s designed pretty much for this purpose!

Sober but simple AI

Putting this all together wasn’t that difficult. Since this is all testing code, I mostly focused on getting something that works done. This should be mostly a write-once-reap-benefits-forever type of scenario so whatever works in the least amount of time is good.

I still wanted to avoid writing a complicated AI that would try to play the game. It’s much more important that it moves around the world and does stuff.

So this is the AI’s entire behaviour:

If an adjacent tile has food on it, walk there
If an adjacent tile has a dose on it, walk there
If you can walk in the same direction as before, walk there 75% of the time
If it’s blocked but the adjacent tile in the same direction (e.g. you want to go north which is blocked but north-west or north-east are okay), go there
Pick a random direction otherwise

This works surprisingly well. At a glance, the replays look a lot like the ones I create when actually playing the game. With a small but significant difference:

The AI lasts about 10 turns before dying of delirium tremens, overdose or being killed by a monster. Oops.

This is a roguelike. You don’t get far if you’re not careful.

Which brings us to the final question: how can we make the AI player last more than a handful of turns without going back to needing a university research grant?

Superbot

Answer: by cheating. Being a robot in this world is not easy.

Epilogue

And that’s it. I wrapped the “AI” code in a loop that compiles the game, runs it, waits for it to end (or detects a crash), runs the replay, records the results and prints them out at the end.

Then ran it, fixed the bug, found out there’s been another, subtler bug underneath, scratched my head a little, realised what was wrong, banged my head on the wall and fixed it, too.

And finally, ran the script again on a fine Saturday morning, got back to it the next day and found that all 250 cases passed \o/.

If you’re interested, the full test script is in scripts/fuzz.py and the game counterpart is is src/engine/remote.rs.

by Tomas Sedovic on 9th April, 2017