How to test XCOM “dice rolls” for fairness

Win-Vector Blog 2013-03-15

XCOM: Enemy Unknown is a turn based video game where the player choses among actions (for example shooting an alien) that are labeled with a declared probability of success.

Image copyright Firaxis Games

A lot of gamers, after missing a 80% chance of success shot, start asking if the game’s pseudo random number generator is fair. Is the game really rolling the dice as stated, or is it cheating? Of course the matching question is: are player memories at all fair; would they remember the other 4 out of 5 times they made such a shot?

This article is intended as an introduction to the methods you would use to test such a question (be it in a video game, in science, or in a business application such as measuring advertisement conversion). There are already some interesting articles on collecting and analyzing XCOM data and finding and characterizing the actual pseudo random generator code in the game, and discussing the importance of repeatable pseudo-random results. But we want to add a discussion pointed a bit more at analysis technique in general. We emphasize methods that are efficient in their use of data. This is a statistical term meaning that a maximal amount of learning is gained from the data. In particular we do not recommend data binning as a first choice for analysis as it cuts down on sample size and thus is not the most efficient estimation technique.In this article we are going to ignore issues that are unique to pseudo random number generators such as “save scumming” and solving for the hidden generator state.

Save scumming is noticing the sequence of coin flips is in fact deterministic, so by re-starting from a save and using a bad flip on an event we don’t care about can allow the player to move a good flip to an event they do care about. Statisticians are fairly clever about avoiding this by ensuring that separate processes use separate random number sources, so a change in behavior of one process can’t introduce a change in behavior in another by changing what random numbers the second process sees.

Solving for the hidden state of the generator is when, after watching a sequence of outputs of the generator, you collect enough information to efficiently recover the complete state of the generator. So no coin flip from that point forward will ever by surprising. For example see “Reconstructing Truncated Integer Variables Satisfying Linear Congruences”, Frieze, Hastad, Kannan, Lagarias, Shamir, Siam J. Comput., Vol. 17, No. 2, April 1988. These are indeed powerful and interesting questions, but are too related to computers games and simulations to apply to data that comes from real world situations (such as advertisement conversion rates). So we will leave these to the experts.

A proper analysis needs at least: a goal, method and data. Our goal is to see if there is any simple systematic net bias in the XCOM dice rolls. We are not testing for a bias varying in a clever way depending on situation or history. In particular we want to see if we are missing “near sure thing” shots more than we should (so we want to know if the bias varies as the reported probability of success changes). Our method will be to test if observed summary statistics have surprisingly unlikely values. We collected data on one partial “classic ironman” run of XCOM: Enemy Unknown on an XBox 360. The data is about 250 rows, is all of the first 20 missions and can be found in “strong TSV format” here: XCOMEUstats.txt. We will use R to analyze the data.

A foremost question for the analyst is: is the data appropriate for the questions I need to answer? For example all of this data is from a single game, so it is completely inappropriate for testing if there is any per-game bias (some state set that makes some play throughs “net lucky” and others “net unlucky”). There are, however, around 250 usable rows in the data set: so the data should be sufficient to test if there is a large unchanging bias (that is assumed to not depend on the play through, game state or history). To test for smaller biases or more complicated theories you would need more data and to record more facts. As an aside: notice that I do not talk about a treatment and control set. I have found slavish experimental set-up (I won’t call it design) to always appeal to “treatment and control” is absolutely no substitute for actually taking the responsibility of thinking through if your data actually supports the type of analysis you are attempting. Just because you “have a control” does not mean you have a usable experimental design, and many legitimate experiments do not have a useful group labeled as “control.”

To make the data collection simple and reliable I recorded only a few facts per row:

mission number: this is which mission we are on
shot number: this was easier to track than player turn, as it is the data-set row number
hit probability: the game-reported chance of success of the shot, this is the quantity we are trying to validate, reported as a percentage from 0 to 100
hit: the actual outcome of the shot, 1 if we hit 0 if we missed.
grenade: 1 if the action was “throws grenade,” blank otherwise (could be used to condition on these rows from later analysis).
rocket: 1 if the action was “rocket firing,” blank otherwise (could be used to condition on these rows from later analysis).
headshot: 1 if the action was a “sniper headshot,” blank otherwise (could be used to condition on these rows from later analysis).
weapon type: what type of weapon used, right now always “projectile”, “laser” and “arc thrower” (I have not yet unlocked plasma weapons).

The initial goal was to get about 100 observations per major weapon type (arc thrower is a specialist weapon, so it would take a very long time to collect a lot of data on it) from about 10 missions. No analysis was performed prior to stopping at ten missions of data collected. This is a simple (but not entirely necessary) method of avoiding a “stopping bias” as we would expect even a fair coin sequence to appear somewhat unfair on some prefixes (see, for example, the law of the iterated logarithm). So an inspection that played “until something looked off” would have a large bias for false alarms (this is in fact, unfortunately, how most commercial research is done: see Why Most Published Research Findings Are False). We will mention the nature of the false alarm effect when we discuss significance. Like “control groups” this stopping bias isn’t something mystical that can only be avoided through certain rituals- but a real and measurable effect that you need to account for.

First we load the data into R:

d <- read.table('http://www.win-vector.com/dfiles/XCOM/XCOMEUstats.txt',   header=T,sep='\t')d[is.na(d)] <- 0 # replace all NA (which came from blanks) with 0

The basic plan for analysis is: chose a summary statistic and compute the significance of the value you observe of that statistic. For our first summary statistics we just use “total number of hits.”

sum(d$hit)

Which turns out to be 191. In our data set “hit” is a variable that is written as 1 if we hit and 0 if we missed. We chose this representation because if hit.probability were the actual correct percent chance of hitting then we should have:

sum(d$hit)  nearly equals  sum(d$hit.probability/100.0).

That is because a probability of a hit is just the expected value of the process that gives you 1 point for a hit and 0 for a miss plus the remarkable fact that expected values always add. The fact that expected values always add is both remarkable and an immediate consequence of the definition of expected value (“The Probabilistic Method” by Noga Alon and Joel H. Spencer calls this the “Linearity of Expectation” and devotes an entire chapter to clever uses of this fact). So what is the sum of reported hit probability in our data set?

sum(d$hit.probability/100.0)

Which turns out to be 179.73. So in my single game I actually hit a bit more often (191 times) than the game claimed I would (179.73 times). A quick question is could this be do to rounding or truncation? We check that the difference in percentage points:

(sum(100*d$hit) - sum(d$hit.probability))/length(d$hit)

Which is 4.49, too large for rounding (which would by at most +/-1 and hopefully +/- 0.5 on average).

This brings us to significance. We want to know is: if this difference of about 11 hits is large or small? We in fact want to know if it was large or small in the special sense: was such a sum likely or unlikely to happen (and this is significance). The question is usually formed as follows: if I assume exactly what I am trying to disprove (that the game is fair) how often when I played would I see a difference (from an assumed fair game, also called the null-hypothesis) a difference as large as what I saw? If what I saw is rare (or hard to produce from a fair game), then I may reject the null hypothesis and say I don’t believe my original assumption that the game is fair (which was my intent in setting up the experiment in the first place). Now you can never “prove the null hypothesis” with this sort of experimental design (you can only reject the null hypothesis or fail to reject the null hypothesis). If the null hypothesis were in fact true, every time you collected more data you would get another equivocal result that you can’t quite reject the null hypothesis yet. “But more data may help.” However, for a true null each time you collect more data you will likely get yet another non-definitive result. So the data scientist will have to use judgement and decide where to stop at some point.

This standard interpretation of significance is why you don’t want to allow “venue shopping” or “data scumming.” Suppose I secretly played 30 different games of XCOM: Enemy Unknown and then showed you only the one play-through where “wow, that set of coin-flips was only 1 in 20 likely- the game must be unfair.” If you know only about the game I showed you the claim is you are seeing something that is only 1/20 likely under the null hypothesis (so a p-value of 0.05) and perhaps decent evidence against the null hypothesis (that the game is fair). However if you are then informed I had to play 30 games to find the bad example (and I only showed you the worst) the response would be: of course in 30 plays you would expect to see something that only happens one time in twenty by random chance- as you took more than 20 trials Of course data scientists always perform more than one analysis. If it was always a-priori obvious what the exact right analysis would be the job would be a lot easier. The saving fact is that we can use a very crude significance correction: if we ran k experiments and the best one had a significance of p (small being more interesting) then the significance of the “cherry pick adjusted” experiment is no more than k*p. So if we run 100 experiments and the best has p-value of 0.0001 then even after the cherry picking correction we know we have a significance of at least 100*0.0001 = 0.01 which is good. The second saving grace is that p-values decrease rapidly when you add more data. If we know we want to try k-experiments than collecting a log(k) multiple more data is enough to defend against data scumming or venue shopping. The thing that is expensive in data is attempting to measure smaller clinical effect sizes. If you halve what you think the size of the effect of some non-existent effect (like ESP) you are trying to measure (“oops I didn’t say I had a 5% advantage guessing wavy cards, I meant a 2.5% advantage”) you need to quadruple the amount of data collected. Effect size you are trying to measure enters your required sample size as a square. This is why it is easy for somebody defending a non-effect to run a cooperating data scientist ragged by revising their claimed expectations.

Back to our XCOM analysis. We said the strategy is to propose a summary and compute its significance. There are a few great ways to do this: empirical re-sampling, permutation tests and simulation. We will use simulation. We will write new code to generate hit outcomes directly from the published probabilities:

library(ggplot2)simulateC <- function(x) {  # x = probabilities         simHits <- ifelse(runif(length(x))<=x,1,0)         sum(simHits)}simulateC(d$hit.probability/100.0)drawsC <- sapply(1:10000,function(x) simulateC(d$hit.probability/100.0))mean(drawsC)sC = sum(d$hit)ggplot() + geom_histogram(aes(x=drawsC)) + geom_vline(x=sC)

The above R-code runs the simulation 10,000 times and plots the histogram of how often different numbers of hits show up. Our game experience is added to the graph as a vertical line. The graph is given below:

CountHistogram

In the above graph the mass to right of the vertical line is how often a random re-simulation saw a count of at least as many hits as us. This is called the “one sided tail” and if there is a lot of mass in this tail then we were not that unlikely (not very significant) and if there is not much mass in this tail our measurement was very rare and very significant. The R commands to compute the mass in the tail are easy:

eC <- ecdf(drawsC)1 - eC(sC)

This turns out to be 2.78%. The R-command “ecdf()” returns a function that computes the amount of mass below a given threshold. So eC(S) gives us the amount of mass not more than S (a “left tail” if S is small), 1-eC(L) gives us the right tail and eC(S) + 1 - eC(L) gives us the mass in both tails (or the two-sided tail).

Note: trusting the simulation significance results means you are trusting the pseudo random generator used to produce them (in this case R’s generator). The only ways to avoid trusting your test pseudo random generator is to use a trusted true-random entropy source or to deliberately pick a test where you know the exact expected theoretical shape of the cumulative distribution. Statisticians are the masters of exact theoretical tests and usually pick from a very limited set of summary statistics (counts, means, standard deviations) so they can apply known theoretical test distributions (t-tests, f-tests and so on).

Our p-value of 0.0278 is considered significant. The usual rule of thumb is that p ≤ 0.05 is considered significant). Notice we are using an empirical p-value (re-simulating generation of hits from the assumed distribution) instead of a parametric p-value (assuming a distribution of the outcomes and using the theoretical mean and a theoretical variance). Empirical p-values much better to explain (they are a sampling of what would exactly happen if you repeated the null-experiment again and again) and so easy to compute that there is really no reason to use the distributional methods (Normal, Student-t, chi-Sq or so on) until you are repeating the calculation very many times. It saves one level of explanations to directly estimate the significance through re-simulation than to bring in “the standard approximations” (and their attendant assumptions).

One important consideration is that we didn’t specify before running this experiment that we thought we would experience above-average luck (in fact we came in thinking we were getting ripped off, so we were looking for a low hit count). So we should be looking either at “two sided tails” (accept mass from both counts ≤ of of the distribution measure how far we were from the mean in absolute value terms) or at least double our p-value to 0.0556 to respect that we implicitly ran two experiments. The p-value for the two sided tail is gotten as follows:

expectation <- sum(d$hit.probability/100.0)diff <- abs(sum(d$hit)-expectation)eC(expectation-diff) + (1-eC(expectation+diff))

Which is 0.0646 (or even worse than the 2*p correction). What this means is that: if we had started the experiment with the hypothesis that XCOM was under-reporting hit probabilities (or equivalently cheating in our favor) we had collected just enough data to reject the null hypothesis (that XCOM is perfectly fair) according to standard clinical standards (which I have never liked, as they are far too lenient). However we started with the hypothesis that XCOM was over-reporting hit probabilities (or cheating in its own favor) and switched hypothesis when we saw our hit count was high. Under this situation we did not collect enough data to reject the null hypothesis as the 2-side p-value is 0.0646 and the corrected 1-sided p-value becomes 0.0556 (both above the middling 0.05 standard). We would not expect to have to double our data to get better p-values (as p-values fall fast when you add data), but if we were to continue to collect data we should know our hypothesis has not been taken from the data (so we should probably use the 2-sided p-value and still multiply by an additional 2 as we have already run a few experiments or done some venue shopping on this data). Also, remember if XCOM is fair all experiments will look equivocal- fail to prove it is unfair but not quite look fair. So really we have seen nothing to be suspicious about at this point. It is a strange but true fact that statistics is an intentional science: what you know and how much of the data you have snooped really does affect the actual objective significances you experience. If you fail to put in some sort of compensation for how many experiments you have run and how often you switched measurement or hypothesis you will mis-report the ideal theoretical significance of a single clean room experiment (that you really did not run) as the significance of the entangled combination of measurements you actually did implement.

Part of the reason we are being so cagey accepting differences (but you always should be so), is that we strongly suspect (due to the forensic science study of Yawning Angel) that the generator is in fact fair. At least it is fair in a total sense (we are not testing for state-driven cheating or streaks).

Another summary we could look at (instead of total counts) is total surprise. This is a metric more sensitive to effects like “I swear I miss 80% shots half the time, how is that fair?” The surprise of an outcome is the negative of the logarithm (base 2) of the probability of the given outcome. Hitting an 80% shot has low surprise: -log_2(0.8) = 0.32 whereas missing an 80% shot has a high surprise -log_2(1-0.8) = 2.32. The total surprise for the shot sequence I observed is given by:

surprise <- function(x,o) {   # x = probabilities, o=actual outcomes  sum(ifelse(o>=0.5,-log(x,base=2),-log(1-x,base=2)))}s <- surprise(d$hit.probability/100.0,d$hit)

This turns out to be 153.7. So we have a new summary statistic, we now need to know if it is significantly large or small. The theoretical expected surprise of a sequence of probabilities is a quantity called the entropy and this is given by:

entropy <- function(x) {  # x = probabilities  ifelse(x<=0,0.0,ifelse(x>=1,0,-x*log(x,base-2) - (1-x)*log(1-x,base=2)))}sum(entropy(d$hit.probability/100.0))

The information theoretic entropy is 164.8. So our experienced surprise is in fact lower than expected, outcomes tended to go the majority direction slightly more often than expected (not less as missing a lot of near sure things would entail). We can again use empirical simulation to get the distribution of expected entropies and estimate the signficance:

simulate <- function(x) {         simHits <- ifelse(runif(length(x))<=x,1,0)         surprise(x,simHits)}simulate(d$hit.probability/100.0)draws <- sapply(1:10000,function(x) simulate(d$hit.probability/100.0))ggplot() + geom_density(aes(x=draws),adjust=0.5) + geom_vline(x=s)

Again we see that we are not a very rare event in terms of the possible distributions of surprise:

EntropyDensity

In fact even the one-sided p-value is quite large (and poor) at 0.1 (e <- ecdf(draws); e(s)), let alone the more appropriate two-sided tail probability.

An additional thing to look for: is can we build a useful probability re-mapping table for the reported probabilities? We know the totals are mostly right and the outcomes of near-certain and rare events are largely right. Could there be some band of predictions that is biased (say the 70% to 80% range)? This is also easy to do in R:

ggplot(data=d,aes(x=hit.probability/100.0,y=hit)) +    geom_point(size=5,alpha=0.5,position=position_jitter(w = 0.01, h = 0.01)) +    geom_smooth() + geom_abline(slope=1) + opts(aspect.ratio=1) +    scale_x_continuous(limits=c(0,1)) + scale_y_continuous(limits=c(0,1))

This produces the following figure:

SmoothCurve

The x-axis is the game-reported hit probability, the y-axis is the observed probabilities (always either 0 or 1 as each hit either happens or does not). Each black circle represents one of our recorded observations. The blue line with error-band is the spline-fit relation. It is estimating the ratio of hits to misses as a function of the stated predicted hit probability. Early on the blue curve is low because most black dots are at y=0, for higher x the curve pulls up proportional to fraction of points at y=1. Notice how close the blue curve is to the black line y=x and the error bar hardly pulls off the black line except in the 0.5 to 0.7 region. So maybe mid-values are slightly under predicted, but we don’t have enough data to say so (and more data would probably just show a new tighter correspondence instead of confirming this divergence). A similar plot can be made using the GAM package, but it is harder to get the error bars.

This graph, which is the kind of thing the data scientist should look at points out yet another data deficiency in our study. The distribution of shots probabilities attempted is given as best for play, possibly not best for analysis (a property of all real data when you don’t get complete control over the experimental design). The distribution (again represented as a density) of the shots I attempted is given below:

ProbDist

(for how to read density plots see My Favorite Graphs).

The core purpose of the article hasn’t so much been the analysis of XCOM itself, but to show how to analyze this type of data. We have emphasized methods that can deal with many different probabilities at the same time (as opposed to binning) in the interest of “statistical efficiency.” That is: to get the most results out of what little data we have. This is always important when you are producing annotated data, which is always going to be per-unit expensive, even in this “age of big data.” Finding usable relations and biases is the exciting part of data science, but one of the responsibilities of the data scientist is protecting the rest of their organization from the ruinous effect of pursuing spurious relations. You really don’t want to report a relation where there was none.