Parallelism in Haskell

Wildon's Weblog 2022-05-01

The purpose of this post is to give the simplest way I’ve found to get parallel execution to work in ghc 8.10.1.

The unhelpful cabal

I’ll begin by getting rid of the trivial difficulties. I’m obliged to use Stack (and the related cabal system), as it’s the only way I can find to get ghc onto recent versions of Mac OS X. I am unable to make stack work in any simple way. Here is the complicated way I’ve settled on. For each project:

  1. Make a directory with all .hs source files.
  2. Create a .cabal file as the example below.
  3. Run stack --init.
  4. Edit stack.yaml so that the resolver field reads resolver: lts-18.3.
  5. Run stack -ghci
  6. Build with stack build
  7. Run executable with stack exec Main <arguments>

For (1) use a separate Main.hs module declaring main. Do not expect the ghc option -main-is to work. A suitable .cabal file for (2) is

name:               ParTestversion:            0.0.0.0cabal-version:      >= 1.22build-type:         Simpleexecutable Main  main-is:          Main.hs  other-modules:    ParTest  build-depends:    base >= 4.14.1.0, array, deepseq, parallel  default-language: Haskell98  ghc-options:      -threaded -O2 -with-rtsopts=-N

Note that the build-depends field is separate from any import declarations in the .hs files and uses a different naming convention. For instance the package parallel above corresponds to import Control.Parallel and import Control.Parallel.Strategies in the module file. Fortunately if you omit a required package from build-depends then there is a very helpful error message that includes the package name you need. The purpose of (4) is to prevent stack from downloading a more recent version of ghc than it supports: ‘Stack has not been tested with GHC versions above 8.10, and using 9.0.2, this may fail’ — indeed, fail it does. (Thanks to stack‘s insistence on downloading an entirely fresh ghc for each project, I suspect I have now downloaded over 30Gb worth of copies of ghc trying to chase down this glitch.) At least once I’ve found that (6) fails unless one first does (5): I have no idea why.

More positively, while I’ve had no success at all in using --ghc-options="-threaded" to pass options to ghc, the ghc-options field in the cabal file automates all of this, with the relevant run-time option included to boot. And I will concede that despite all the fuss of using cabal, it does in my experience avoid the dreaded ‘dependency hell’.

Note that ghc has to be explicitly told to compile threaded code. If one omits -threaded then par and the related constructs from Control.Parallel appear to have no effect. (When testing this, I discovered that its often essential to run stack clean after editing the .cabal file to make the changes ‘stick’.) This behaviour of ghc seems unnecessarily unhelpful to me: why can’t the compiler issue a warning?

Benchmarking with lazy evaluation

A recurring issue with Haskell benchmarks is that lazy evaluation means that it’s very easy to write code that executes quickly, but only becaue it doesn’t do what strict imperative languages such as C might lead one to expect. For a simple example, suppose expensive :: Integer -> Integer is a function such that expensive k takes time \mathrm{O}(2^k) to compute. We can build a list of the first n values of expensive using

testExpensive n = length [expensive k | k <- [1..n]]

Irrespective of how expensive is defined, evaluating textExpensive at the Haskell prompt takes time \mathrm{O}(n). The reason is that under lazy evaluation, only the ‘existence’ of each entry of the list has to be established; each expensive k sits within the list as an unevaluated thunk. This can be seen interactively using the very useful sprint command in ghci. Try for instance

let xs = [expensive k | k <- [1..n]]

for the expensive function of your choice (for instance collatzSum below), followed by :sprint xs, length xs, :sprint xs, sum xs, :sprint xs.

An expensive function the compiler can’t magic away

Open mathematical problems are a rich source of such functions. Here I’ll go for the Collatz conjecture, on which Paul Erdős said “Mathematics is not yet ripe enough for such questions”; despite an astonishing partial result due to Terence Tao, his assessment still stands today. Let us define:

collatzList :: Integer -> [Integer] collatzList 0 = error "collatzList: 0"collatzList 1 = []collatzList n | n `mod` 2 == 0 = n : collatzList (n `div` 2)              | otherwise      = n : collatzList (3 * n + 1)          collatzSum :: Integer -> IntegercollatzSum n = sum $ collatzList n

While ghc is a deeply impressive piece of software, I think we can safely assume that ghc does not have, deeply embedded in its internals, an effective proof of the Collatz conjecture that gives a short-circuited evaluation of collatzSum. We can therefore take collatzSum as the expensive function we require.

Parallelising the Collatz sum of sums

Evaluating sum [collatzSum k | k <- [1..1000000], using the module Main.hs at the end of this post, takes about 9.5 seconds on my computer. The code below parallelises this computation by splitting the list in two, and evaluating the sum of each half in parallel. Its running time is about 5 seconds.

collatzSumOfSumsP :: Integer -> IntegercollatzSumOfSumsP r = sA `par` (sB `par` sA + sB)    where sA = sum [collatzSum n | n <- [1..r], n `mod` 2 == 0]          sB = sum [collatzSum n | n <- [1..r], n `mod` 2 == 1]

Here par is a Haskell function with the type a -> b -> b. The unusual type is a clue that it breaks the normal rules of evaluation. (Indeed, the reasoning used in Wadler’s remarkable paper Theorems for free! shows that the only strict function with this type must discard the first variable and return the second unmodified.) The effect of par is to evaluate the first argument to weak head normal form (WHNF), and then, in parallel, but otherwise following the usual rules of Haskell evaluation, evaluate the second argument. (Compare seq :: a -> b -> b which does exactly the same, but in series.) In particular, since sA is defined in the where clause, it is evaluated exactly once, in parallel with sB.

In fact the simpler definition collatzSumOfSumsP r = sA `par` (sB `par` sA + sB) also works: the compiler is smart enough to avoid the ‘not-parallel’ code that in theory computes sA and sA+sB in parallel, but hangs on sA+sB waiting for sA, rather than getting on with sB.

The variation below evaluates the lists in parallel and then takes the sum. Its running time is (up to random noise) the same as collatzSumOfSumsP

interleave :: [a] -> [a] -> [a]interleave xs [] = xsinterleave [] ys = ys interleave (x : xs) (y : ys) = x : y : interleave xs ys parallelInterleave :: (NFData a) => [a] -> [a] -> [a]parallelInterleave xs ys =    runEval $ do         xsP <- rpar (force xs)        ysP <- rpar (force ys)        return $ interleave xsP ysPcollatzSumOfSumsPL :: Integer -> IntegercollatzSumOfSumsPL r = sum $ parallelInterleave ssA ssB    where ssA = [collatzSum n | n <- [1..r], n `mod` 2 == 0]          ssB = [collatzSum n | n <- [1..r], n `mod` 2 == 1]

This version is more typical of the kind of problem I really want to parallelise, in which a huge collection of candidate solutions is generated, and then some further computation is done on this collection to decide which solutions work.

Parallel interleave

The engine driving collatzSumOfSumsPL is the function parallelInterleave, which uses rpar from Control.Parallel.Strategies and force from Control.Deepseq to evaluate the lists in parallel.

The use of force is essential because of a very well-known Haskell ‘gotcha’. Consider for instance let xs = collatz 10; seq xs "Now xs has been evaluated?!". One can check using :sprint that xs is indeed evaluated, but only to WHNF: the output is xs = 10 : _. The correct implementation of force (for lists) is recursive:

forceList :: [a] -> [a]forceList [] = []forceList (x : xs) = x `seq` forceList xs  

The problem with collatzSumOfSumsPL is that since it evaluates both the lists ssA = [collatzSum n | n <- [1..r], n `mod` 2 == 0] and ssB = [collatzSum n | n <- [1..r], n `mod` 2 == 1] in full before taking the sum of each list (and then the sum of the sums), it uses memory proportional to the running time of the entire program, rather than memory bounded by the largest value of collatzSum n as in collatzSumOfSumsP. Our use of par and seq has forced us to confront the messy reality that semantically equivalent programs can have very different space requirements.

More on space

The code below computes collatzSumOfSums by using an array, storing the value collatzSum n for each relevant n. Since Haskell arrays are strict, this means each call collatzSumOfSums r has to fully allocate an array of size \mathrm{O}(r).

hugeCollatz :: Integer -> Array Integer IntegerhugeCollatz r = array (1, r) [(i, collatzSum i) | i <- [1..r]]sumArray :: Array i Integer -> IntegersumArray arr = sum $ elems arrcollatzSumOfSumsA :: Integer -> IntegercollatzSumOfSumsA r = sumArray $ hugeCollatz r

Using collatzSumOfSumsA as the expensive function in our benchmarks, the extra space required to interleave the strictly evaluated lists before taking the sum can easily be seen using top.

collatzSumOfSumOfSums :: Integer -> IntegercollatzSumOfSumOfSums t = sum [collatzSumOfSums r | r <- [1..t]]collatzSumOfSumOfSumsAPHuge :: Integer -> IntegercollatzSumOfSumOfSumsAPHuge t = sum [sumArray arr | arr <- arrays]    where arraysA = [hugeCollatz r | r <- [1..t], r `mod` 2 == 0]          arraysB = [hugeCollatz r | r <- [1..t], r `mod` 2 == 1]          arrays  = parallelInterleave arraysA arraysBcollatzSumOfSumOfSumsAPSmall :: Integer -> IntegercollatzSumOfSumOfSumsAPSmall t = sum ss    where ssA = [sumArray $ hugeCollatz r | r <- [1..t], r `mod` 2 == 0]          ssB = [sumArray $ hugeCollatz r | r <- [1..t], r `mod` 2 == 1]          ss  = parallelInterleave ssA ssB

On my machine, evaluating these three functions for t = 2500 take respectively:

  • 14 seconds, memory 20Mb
  • 8 seconds, memory growing to 320Mb
  • 8 seconds, memory 20Mb

The saving in collatzSumOfSumOfSumsAPSmall from summing the arrays as they are produced is shown in the diagram below.

Main

For the record here is the Main.hs module I used to get these timings.

module Main whereimport System.Environmentimport ParTest-- Tests with r = 1000000testSumOfSums :: IO ()testSumOfSums =do putStrLn "testSumOfSums"rStr : _ <- getArgslet r = read rStr :: Integer-- putStrLn $ show $ collatzSumOfSums r   -- 316% cpu, 9.582 total-- putStrLn $ show $ collatzSumOfSumsP r  -- 304% cpu, 4.962 totalputStrLn $ show $ collatzSumOfSumsPL r    -- 342% cpu, 4.998 total-- Tests with t = 2500testSumOfSumOfSums :: IO ()testSumOfSumOfSums =do tStr : _ <- getArgslet t = read tStr :: Integer--putStrLn $ show $ collatzSumOfSumOfSums t       -- 312% cpu 13.837 total, memory 20Mb--putStrLn $ show $ collatzSumOfSumOfSumsAPHuge t -- 435% cpu 7.902 total, memory to 320MbputStrLn $ show $ collatzSumOfSumOfSumsAPSmall t  -- 377% cpu 7.077 total, memory 20Mbmain :: IO ()main = testSumOfSums-- main = testSumOfSumOfSums