One of the things I learned from developing my first game was that a lot of the work involved is very tedious. It’s essentially busy-work. Most of that involved writing the code to do the menus, lining things up so they looked good, and tweaking the pace of the game.
There’s also another class of work involved with game programming – the hard stuff. This is my technical term for the parts where you don’t know how to do it but you need to do it. What a great definition, huh?
This game is a word game and the central tenet are words. The basic gameplay mechanic requires us to be able to check if a given “word” is in fact a valid one. It is essentially the same problem as a spellchecker (although we aren’t interested in finding suggestions). This lookup needs to be very fast on a very large word set.
I was able to find a free word list online that contained some 172,000+ words. That is an extremely large number of words. I started thinking about what data structure could do this well and I settled on thinking that a DFA (deterministic finite automata) was the best way to represent these words.
Here’s some other requirements I thought of:
- The word list can dynamically change over the course of development. I want to be able to add or remove words and re-compile.
- We need to be able to take a string of some text and see if it is a recognized word. We need to be able to do this as quickly as possible.
- We need the storage of the words to be very compact so that it doesn’t bloat my game and doesn’t take a long time to load.
All of these things can be solved by tapping into the XNA content pipeline capabilities of XNA Game Studio. I decided that I would create a custom content importer and processor.
The importer would open the text file containing the raw word list. It would then uppercase all of the words and check for redundancies and trim any whitespace to ensure that the words were all nice and clean. It would then store the words list in a managed object to pass to the content processor.
The content processor is where most of the hard work is done. The DFA is built here. The DFA is basically just a list of integers encoding the nodes and a list of edges from each node. Each node encodes the letter contained at it, a flag indicating if it is a word terminator, and other required information. The edges then connect the nodes together. Nodes are shared as much as possible so that common prefixes and suffixes are eliminated. This leads to a very compact representation of the word list.
At the end of the processor’s work, it stores the data in a temporary object that then gets passed along to the ContentTypeWriter. I just take the nodes and edges data and write it to disk as arrays of integers.
On my PC, it takes about 55 seconds to process the entire word list (about 172,500 words in total). Since this is compilation time, we don’t really care how long this takes. What I really care about is how long it takes to load at runtime and how fast it is for word lookup. Luckily this method of word representation is very nice and leads to both a small footprint and fast lookup time. In fact, with XNB compression turned on, practically the entire English language is represented in just 315K! Although I haven’t done any clocking, the lookup time is very snappy too. I imagine that thousands can be performed per frame if necessary (and that might be necessary for the AI later down the road…).
Perhaps I will post more about this DFA business later when I have more time.