Parallelism in Haskell
Wildon's Weblog 2022-05-01
The purpose of this post is to give the simplest way I’ve found to get parallel execution to work in ghc 8.10.1.
The unhelpful cabal
I’ll begin by getting rid of the trivial difficulties. I’m obliged to use Stack (and the related cabal system), as it’s the only way I can find to get ghc
onto recent versions of Mac OS X. I am unable to make stack
work in any simple way. Here is the complicated way I’ve settled on. For each project:
- Make a directory with all
.hs
source files. - Create a
.cabal
file as the example below. - Run
stack --init
. - Edit
stack.yaml
so that the resolver field readsresolver: lts-18.3
. - Run
stack -ghci
- Build with
stack build
- Run executable with
stack exec Main <arguments>
For (1) use a separate Main.hs
module declaring main
. Do not expect the ghc option -main-is
to work. A suitable .cabal
file for (2) is
name: ParTestversion: 0.0.0.0cabal-version: >= 1.22build-type: Simpleexecutable Main main-is: Main.hs other-modules: ParTest build-depends: base >= 4.14.1.0, array, deepseq, parallel default-language: Haskell98 ghc-options: -threaded -O2 -with-rtsopts=-N
Note that the build-depends
field is separate from any import declarations in the .hs
files and uses a different naming convention. For instance the package parallel
above corresponds to import Control.Parallel
and import Control.Parallel.Strategies
in the module file. Fortunately if you omit a required package from build-depends
then there is a very helpful error message that includes the package name you need. The purpose of (4) is to prevent stack
from downloading a more recent version of ghc
than it supports: ‘Stack has not been tested with GHC versions above 8.10, and using 9.0.2, this may fail’ — indeed, fail it does. (Thanks to stack
‘s insistence on downloading an entirely fresh ghc
for each project, I suspect I have now downloaded over 30Gb worth of copies of ghc
trying to chase down this glitch.) At least once I’ve found that (6) fails unless one first does (5): I have no idea why.
More positively, while I’ve had no success at all in using --ghc-options="-threaded"
to pass options to ghc
, the ghc-options
field in the cabal file automates all of this, with the relevant run-time option included to boot. And I will concede that despite all the fuss of using cabal
, it does in my experience avoid the dreaded ‘dependency hell’.
Note that ghc
has to be explicitly told to compile threaded code. If one omits -threaded
then par
and the related constructs from Control.Parallel
appear to have no effect. (When testing this, I discovered that its often essential to run stack clean
after editing the .cabal
file to make the changes ‘stick’.) This behaviour of ghc
seems unnecessarily unhelpful to me: why can’t the compiler issue a warning?
Benchmarking with lazy evaluation
A recurring issue with Haskell benchmarks is that lazy evaluation means that it’s very easy to write code that executes quickly, but only becaue it doesn’t do what strict imperative languages such as C might lead one to expect. For a simple example, suppose expensive :: Integer -> Integer
is a function such that expensive k
takes time to compute. We can build a list of the first
n
values of expensive
using
testExpensive n = length [expensive k | k <- [1..n]]
Irrespective of how expensive
is defined, evaluating textExpensive
at the Haskell prompt takes time . The reason is that under lazy evaluation, only the ‘existence’ of each entry of the list has to be established; each
expensive k
sits within the list as an unevaluated thunk. This can be seen interactively using the very useful sprint
command in ghci
. Try for instance
let xs = [expensive k | k <- [1..n]]
for the expensive
function of your choice (for instance collatzSum
below), followed by :sprint xs
, length xs
, :sprint xs
, sum xs
, :sprint xs
.
An expensive function the compiler can’t magic away
Open mathematical problems are a rich source of such functions. Here I’ll go for the Collatz conjecture, on which Paul Erdős said “Mathematics is not yet ripe enough for such questions”; despite an astonishing partial result due to Terence Tao, his assessment still stands today. Let us define:
collatzList :: Integer -> [Integer] collatzList 0 = error "collatzList: 0"collatzList 1 = []collatzList n | n `mod` 2 == 0 = n : collatzList (n `div` 2) | otherwise = n : collatzList (3 * n + 1) collatzSum :: Integer -> IntegercollatzSum n = sum $ collatzList n
While ghc
is a deeply impressive piece of software, I think we can safely assume that ghc does not have, deeply embedded in its internals, an effective proof of the Collatz conjecture that gives a short-circuited evaluation of collatzSum
. We can therefore take collatzSum
as the expensive
function we require.
Parallelising the Collatz sum of sums
Evaluating sum [collatzSum k | k <- [1..1000000]
, using the module Main.hs
at the end of this post, takes about 9.5 seconds on my computer. The code below parallelises this computation by splitting the list in two, and evaluating the sum of each half in parallel. Its running time is about 5 seconds.
collatzSumOfSumsP :: Integer -> IntegercollatzSumOfSumsP r = sA `par` (sB `par` sA + sB) where sA = sum [collatzSum n | n <- [1..r], n `mod` 2 == 0] sB = sum [collatzSum n | n <- [1..r], n `mod` 2 == 1]
Here par
is a Haskell function with the type a -> b -> b
. The unusual type is a clue that it breaks the normal rules of evaluation. (Indeed, the reasoning used in Wadler’s remarkable paper Theorems for free! shows that the only strict function with this type must discard the first variable and return the second unmodified.) The effect of par
is to evaluate the first argument to weak head normal form (WHNF), and then, in parallel, but otherwise following the usual rules of Haskell evaluation, evaluate the second argument. (Compare seq :: a -> b -> b
which does exactly the same, but in series.) In particular, since sA
is defined in the where
clause, it is evaluated exactly once, in parallel with sB
.
In fact the simpler definition collatzSumOfSumsP r = sA `par` (sB `par` sA + sB)
also works: the compiler is smart enough to avoid the ‘not-parallel’ code that in theory computes sA
and sA+sB
in parallel, but hangs on sA+sB
waiting for sA
, rather than getting on with sB
.
The variation below evaluates the lists in parallel and then takes the sum. Its running time is (up to random noise) the same as collatzSumOfSumsP
interleave :: [a] -> [a] -> [a]interleave xs [] = xsinterleave [] ys = ys interleave (x : xs) (y : ys) = x : y : interleave xs ys parallelInterleave :: (NFData a) => [a] -> [a] -> [a]parallelInterleave xs ys = runEval $ do xsP <- rpar (force xs) ysP <- rpar (force ys) return $ interleave xsP ysPcollatzSumOfSumsPL :: Integer -> IntegercollatzSumOfSumsPL r = sum $ parallelInterleave ssA ssB where ssA = [collatzSum n | n <- [1..r], n `mod` 2 == 0] ssB = [collatzSum n | n <- [1..r], n `mod` 2 == 1]
This version is more typical of the kind of problem I really want to parallelise, in which a huge collection of candidate solutions is generated, and then some further computation is done on this collection to decide which solutions work.
Parallel interleave
The engine driving collatzSumOfSumsPL
is the function parallelInterleave
, which uses rpar
from Control.Parallel.Strategies
and force
from Control.Deepseq
to evaluate the lists in parallel.
The use of force
is essential because of a very well-known Haskell ‘gotcha’. Consider for instance let xs = collatz 10; seq xs "Now xs has been evaluated?!"
. One can check using :sprint
that xs
is indeed evaluated, but only to WHNF: the output is xs = 10 : _
. The correct implementation of force
(for lists) is recursive:
forceList :: [a] -> [a]forceList [] = []forceList (x : xs) = x `seq` forceList xs
The problem with collatzSumOfSumsPL
is that since it evaluates both the lists ssA = [collatzSum n | n <- [1..r], n `mod` 2 == 0]
and ssB = [collatzSum n | n <- [1..r], n `mod` 2 == 1]
in full before taking the sum of each list (and then the sum of the sums), it uses memory proportional to the running time of the entire program, rather than memory bounded by the largest value of collatzSum n
as in collatzSumOfSumsP
. Our use of par
and seq
has forced us to confront the messy reality that semantically equivalent programs can have very different space requirements.
More on space
The code below computes collatzSumOfSums
by using an array, storing the value collatzSum n
for each relevant n
. Since Haskell arrays are strict, this means each call collatzSumOfSums r
has to fully allocate an array of size .
hugeCollatz :: Integer -> Array Integer IntegerhugeCollatz r = array (1, r) [(i, collatzSum i) | i <- [1..r]]sumArray :: Array i Integer -> IntegersumArray arr = sum $ elems arrcollatzSumOfSumsA :: Integer -> IntegercollatzSumOfSumsA r = sumArray $ hugeCollatz r
Using collatzSumOfSumsA
as the expensive
function in our benchmarks, the extra space required to interleave the strictly evaluated lists before taking the sum can easily be seen using top
.
collatzSumOfSumOfSums :: Integer -> IntegercollatzSumOfSumOfSums t = sum [collatzSumOfSums r | r <- [1..t]]collatzSumOfSumOfSumsAPHuge :: Integer -> IntegercollatzSumOfSumOfSumsAPHuge t = sum [sumArray arr | arr <- arrays] where arraysA = [hugeCollatz r | r <- [1..t], r `mod` 2 == 0] arraysB = [hugeCollatz r | r <- [1..t], r `mod` 2 == 1] arrays = parallelInterleave arraysA arraysBcollatzSumOfSumOfSumsAPSmall :: Integer -> IntegercollatzSumOfSumOfSumsAPSmall t = sum ss where ssA = [sumArray $ hugeCollatz r | r <- [1..t], r `mod` 2 == 0] ssB = [sumArray $ hugeCollatz r | r <- [1..t], r `mod` 2 == 1] ss = parallelInterleave ssA ssB
On my machine, evaluating these three functions for t = 2500
take respectively:
- 14 seconds, memory 20Mb
- 8 seconds, memory growing to 320Mb
- 8 seconds, memory 20Mb
The saving in collatzSumOfSumOfSumsAPSmall
from summing the arrays as they are produced is shown in the diagram below.
Main
For the record here is the Main.hs
module I used to get these timings.
module Main whereimport System.Environmentimport ParTest-- Tests with r = 1000000testSumOfSums :: IO ()testSumOfSums =do putStrLn "testSumOfSums"rStr : _ <- getArgslet r = read rStr :: Integer-- putStrLn $ show $ collatzSumOfSums r -- 316% cpu, 9.582 total-- putStrLn $ show $ collatzSumOfSumsP r -- 304% cpu, 4.962 totalputStrLn $ show $ collatzSumOfSumsPL r -- 342% cpu, 4.998 total-- Tests with t = 2500testSumOfSumOfSums :: IO ()testSumOfSumOfSums =do tStr : _ <- getArgslet t = read tStr :: Integer--putStrLn $ show $ collatzSumOfSumOfSums t -- 312% cpu 13.837 total, memory 20Mb--putStrLn $ show $ collatzSumOfSumOfSumsAPHuge t -- 435% cpu 7.902 total, memory to 320MbputStrLn $ show $ collatzSumOfSumOfSumsAPSmall t -- 377% cpu 7.077 total, memory 20Mbmain :: IO ()main = testSumOfSums-- main = testSumOfSumOfSums