What is a win vector?

Win-Vector Blog 2014-09-04

From time to time we are asked “what is the company name Win-Vector LLC referring to?” It is a cryptic pun trying to be an encoding of “we deliver victory.”

The story is an inside joke referring to something really only funny to one of the founders. But a joke that amuses the teller is always enjoyed by at least one person. Win-Vector LLC’s John Mount had the honor of co-authoring a 1997 paper titled “The Polytope of Win Vectors.” The paper title is obviously mathematical terms in an odd combination. However the telegraphic grammar is coincidentally similar to deliberately ungrammatical gamer slang such as “full of win” and “so much win.”

2829551 full of win

If we treat “win” as a concrete noun (say something you can put in a sack) and “vector” in its non-mathematical sense (as an entity of infectious transmission) we have “Win-Vector LLC is an infectious delivery of victory.” I.e.: we deliver success to our clients. Of course, we have now attempt to explain a weak joke. It is not as grand as “winged victory,” but it does encode a positive company value: Win-Vector LLC delivers successful data science projects and training to clients.

640px Nike of Samothrake Louvre Ma2369 n4

Winged Victory: from Wikipedia

Let’s take this as an opportunity to describe what a win vector is.

We take the phrase “win vector” from a technical article titled “The Polytope of Win Vectors” by J.E. Bartles, J. Mount, and D.J.A. Welsh (Annals of Combinatorics I, 1997, pp. 1-15. The topic of this paper concerns the possible outcomes of game tournaments (or other things that can be expressed as tournaments). For example: we could have four teams (A,B,C, and D) scheduled to play each other a number of times, as indicated in the diagram below.

This graph is just saying in the tournament: A will play B 5 times, B will not play C, and so on. We assume each game can end in a win for one team (given them 1 point), or a loss or tie (giving them zero points). We can record a summary of the tournament outcomes as a vector (vector now back to its mathematical sense) that just records how often each team won. For example the vector [10,1,1,0] is a win vector compatible with the above diagram (it encodes A winning all matches and D losing all matches). The vector [0,0,0,5] is not a valid win vector for the digram as D did not play 5 games (so can not have 5 wins). (The Win-Vector LLC logo is itself a stylized single game tournament diagram, with the directed arrow representing both victory and reminiscent of vectors in the mathematical sense.)

The idea is that a win vector might be treated as a sufficient statistic for the tournament. Or more accurately the win vector may be all that is known about a previously run tournament. Such censored observations may be all that is possible in field biology where wins represent territory or offspring. The question is then: given knowledge of the tournament structure (the graph) and the summary of outcomes (the win vector) is there evidence one team is dominant, or are the effects random? So we have well-formed statistical questions about effect strength and significance.

The question of significance is: when we introduce a notion of effect strength how likely are we to see an effect of that size assuming identical players. For example if we make our notion of effect strength the maximum ratio of wins to plays seen in the win vector should we consider this evidence of a strong player, or is it to be expected by random fluctuation? We need to estimate how strong a conditioning effect our tournament constraints impose on unobserved outcomes (to determine if irregularities in distribution are from player strengths our tournament mis-design).

Relating distributions of unobserved details to observed totals (or margins) is one of the most fundamental problems in statistics. We have written on it many times (two examples: Google ad market reporting and checking scientific claims). In all cases you would be better off with direct detailed observations (i.e. without the censorship); but often you have to work with the data you have instead of the experiment you would design.

The math is a little easier to explain for a related problem: working out the number of ways to fill in a matrix with non-negative integers to meet given row and column totals. I’ll move on to discuss this contingency table problem a bit.

The statistical ideas largely come from “Testing for Independence in a Two-Way Table: New Interpretations of the Chi-Square Statistic”, Persi Diaconis and Bradley Efron, Ann. Statist., Vol. 13, No. 3, 1985 pp. 845-874. A contingency table is a matrix of non-negative integers, and the statistical problem is relating known row and column totals to possible fill-ins. In this paper the authors criticize some of the standard significance tests (chi-square, Fisher’s exact test) and propose a parameterized family of tests that at the extreme end considers a null-model of uniform fill-ins (each possible fill in equally likely). Obviously a uniform model is very different than the more standard distributions which tend to have cell counts more highly concentrated around their means. But the idea is: this proposed test takes more of the structure of the margin totals into account (or equivalently assumes away less of the margin mediated cell dependencies) and has its own merits.

However, we are actually describing the work of mathematicians and theoretical computer scientists. In that style you only speak with “applied types” (such as theoretical statisticians) to justify working on a snappy math problem. In this case: counting the number of ways to fill in a contingency table or the number of detailed results compatible with a given win vector (the link between counting, and generation having been strongly established in “Randomised Algorithms for Counting and Generating Combinatorial Structures”, A.J. Sinclair, Ph.D. thesisUniversity of Edinburgh (1988) and related works).

The contingency table problem is partially solved in:

“Sampling contingency tables” Martin Dyer, Ravi Kannan, John Mount, Random Structures and Algorithms Vol. 10, no. 4, July 1997 pp. 487-506.
“Fast Unimodular Counting” John Mount, Combinatorics Probability and Computing, Vol. 9, No. 3, May 2000, pp 277-285.

The second paper (strengthening some results from my Ph.D. thesis) lets you calculate that the number of ways to fill in the following four by four contingency table with non-negative integers to meet the shown row and column totals is exactly 350854066054593772938684218633979710637454260 (about 3.508541e+44).

x(0,0)x(0,1)x(0,2)x(0,3)154179x(1,0)x(1,1)x(1,2)x(1,3)255424x(2,0)x(2,1)x(2,2)x(2,3)277000x(3,0)x(3,1)x(3,2)x(3,3)160179191780288348165221201433

The point being: the table could arise as the summary from a data set with 846782 (=191780 + 288348 + 165221 + 201433) items; to characterize probabilities over such a tables you need good methods to sample over the astronomical family of potential alternate fill-ins (and this is where you apply the link between counting and sampling for self-reducible problem families). We have example code, notes, improved runtime proof, and results here.

“The Polytope of Win Vectors” introduced additional ideas from integral polymatroids to more strongly relate volume to number of integer vectors (and gets more complete theoretical results for its problem).

All the “big hammer” math is trying to extend some of the beauty of G.H. Hardy and J.E. Littlewood, “Some problems of Diophantine approximation: the lattice points of a right-angled triangle,” Hamburg. Math.Abh., 1 (1921) 212–249 to more general settings.

Or more succinctly: we just like the word “win.”

Somuchwin