R Scrabble: Part 2
chartsnthings 2014-06-25
(This article was first published on R snippets, and kindly contributed to R-bloggers)
Ivan Nazarov and Bartek Chroł gave very interesting comments to my last post on counting number of subwords in NGSL words. In particular they proposed large speedups of my code. So I thought to try checking a larger data set. So today I will work with TWL2006 - the official word authority for tournament Scrabble in the USA. The question is whether the exponential relationship between the number of letters in the word and the number of its subwords that is observed in NGSL data set still holds for TWL2006. The challenge is that NGSL has 2801 words and TWL2006 is much larger with 178691 words. You can download the file TWL2006.txt which contains the words and was prepared (converted to lowercase and sorted by word length) using data from Internet Scrabble Club website. You could run codes from comments to my last post to obtain the results, but it takes ages to compute (over 1 hour). Therefore I have written the data preparation step procedures in Java - which reduced the time needed to perform the analysis down to around 1 minute. So first let us start with Java code that computes number of subwords for each word in TSL2006 dictionary: package scrabble; import java.io.*; public class Scrabble implements Runnable { public static int[][] b = new int[178691][26]; public static int wlen[] = new int[178691]; // Set number of threads to match your hardware public static final int MAX_THREADS = 6; public static void main(String[] args) throws FileNotFoundException, IOException { String file = "TWL2006.txt"; int i; try (BufferedReader r = new BufferedReader(new FileReader(file))) { i = 0; for (String line; (line = r.readLine()) != null; i++) { for (char c : line.toCharArray()) { b[i][c 1="'a'" language="-"][/c]++; } wlen[i] = line.length(); } } for (i = 0; iI < MAX_THREADS; i++) { new Thread(new Scrabble(i)).start(); } } private final int counter; public Scrabble(int counter) { this.counter = counter; } @Override public void run() { String filename = "result_" + counter + ".txt"; try (PrintWriter w = new PrintWriter(filename)) { w.println("length, subwords"); for (int i = counter; i < b.length; i += MAX_THREADS) { int[] base = b[i]; int subwordcount = 0; for (int j = 0; (j < b.length) && (wlen[j] <= wlen[i]); j++) { int[] subword = b[j]; boolean issubword = true; for (int k = 0; k < 26; k++) { if (subword[k] > base[k]) { issubword = false; break; } } if (issubword) { subwordcount++; } } w.println(wlen[i] + ", " + subwordcount); } } catch (FileNotFoundException ex) { System.out.println("Failed to open file " + filename); } } } To leave a comment for the author, please follow the link and comment on his blog: R snippets.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...