R Scrabble: Part 2
chartsnthings 2014-06-25
Summary:
Ivan Nazarov and Bartek Chroł gave very interesting comments to my last post on counting number of subwords in NGSL words. In particular they proposed large speedups of my code. So I thought to try checking a larger data set. So today I will work with TWL2006 - the official word authority for tournament Scrabble in the USA. The question is whether the exponential relationship between the number of letters in the word and the number of its subwords that is observed in NGSL data set still holds for TWL2006. The challenge is that NGSL has 2801 words and TWL2006 is much larger with 178691 words. You can download the file TWL2006.txt which contains the words and was prepared (converted to lowercase and sorted by word length) using data from Internet Scrabble Club website. You could run codes from comments to my last post to obtain the results, but it takes ages to compute (over 1 hour). Therefore I have written the data preparation step procedures in Java - which reduced the time needed to perform the analysis down to around 1 minute. So first let us start with Java code that computes number of subwords for each word in TSL2006 dictionary: package scrabble; import java.io.*; public class Scrabble implements Runnable { public static int[][] b = new int[178691][26]; public static int wlen[] = new int[178691]; // Set number of threads to match your hardware public static final int MAX_THREADS = 6; public static void main(String[] args) throws FileNotFoundException, IOException { String file = "TWL2006.txt"; int i; try (BufferedReader r = new BufferedReader(new FileReader(file))) { i = 0; for (String line; (line = r.readLine()) != null; i++) { for (char c : line.toCharArray()) { b[i][c - 'a']++; } wlen[i] = line.length(); } } for (i = 0; iI < MAX_THREADS; i++) { new Thread(new Scrabble(i)).start(); } } private final int counter; public Scrabble(int counter) { this.counter = counter; } @Override public void run() { String filename = "result_" + counter + ".txt"; try (PrintWriter w = new PrintWriter(filename)) { w.println("length, subwords"); for (int i = counter; i < b.length; i += MAX_THREADS) { int[] base = b[i]; int subwordcount = 0; for (int j = 0; (j < b.length) && (wlen[j] <= wlen[i]); j++) { int[] subword = b[j]; boolean issubword = true; for (int k = 0; k < 26; k++) { if (subword[k] > base[k]) { issubword = false; break; } } if (issubword) { subwordcount++; } } w.println(wlen[i] + ", " + subwordcount); } } catch (FileNotFoundException ex) { System.out.println("Failed to open file " + filename); } } } The code uses idea proposed in comments to my last post to hold a table counting number of occurrences of each letter in each word (array b). In order to speed up the computations the code takes advantage from the fact that words in file TWL2006.txt are sorted by their length and additionally is parallelized (6 threads by default) so produces 6 files named result_[i].txt., where i changes from 0 to 6. If you do not have Java compiler you can download Scrabble.jar here (but it is better to build the jar file from the source as your browser might complain about downloading jar files from untrusted sources). And now the code that runs the analysis. It comes in three parts. First we plot the relationship between word length and logarithm of mean subword count