Yesterday morning there was an interesting question on Haskell Cafe:
Hello anyone, I’ve written a snippet which generates a file full of random strings. When compiled with -O2 on ghc-7.6, the generation speed is about 2Mb per second which is on par with interpreted php. That’s the fact I find rather disappointing. Maybe I’ve missed something trivial? Any suggestions and explanations are welcome.
It seemed an interesting exercise and we can’t put up being on par with PHP, so I decided to dig in. The code was the following:
import qualified Data.Text as T
import System.Random
import Control.Exception
import Control.Monad
import System.IO
import qualified Data.Text.IO as TI
= let (len, g') = randomR (50, 450) g
genString g in T.unfoldrN len rand_text (len, g')
where rand_text (0,_) = Nothing
= let (c, g') = randomR ('a','z') g
rand_text (k,g) in Just (c, ((k-1), g'))
= bracket (openFile file WriteMode) hClose $ \h -> do
writeCorpus file let size = 100000
sequence $ replicate size $ do
<- newStdGen
g let text = genString g
TI.hPutStrLn h text
= do
main putStrLn "generating text corpus"
"test.txt" writeCorpus
The first thing I thought was to switch to ByteString
: since we don’t need to deal with UTF-8
encoding (the strings to generate were plain ASCII strings in the range a-z
) we can avoid paying the overhead Data.Text
introduces. After that, the code was a slightly faster, but nothing thrilling (about 7 seconds
on my machine).
The crucial hint was gave from Gregory Collins, who suggested that System.Random
was very slow. He proposed, as better alternative, the mwc-random package, and the fact Bos was the author was a guarantee. After a bit of struggling I ended up with the following code:
import System.Random.MWC
import Control.Monad
import Control.Monad.Primitive
import System.IO
import Data.ByteString as B
import Data.Word (Word8)
import Data.ByteString.Char8 as CB
{- | Converts a Char to a Word8. Took from MissingH -}
c2w8 :: Char -> Word8
= fromIntegral . fromEnum
c2w8
------------------------------------------------------------------------------
charRangeStart :: Word8
= c2w8 'a'
charRangeStart
------------------------------------------------------------------------------
charRangeEnd :: Word8
= c2w8 'z'
charRangeEnd
------------------------------------------------------------------------------
genString :: Gen (PrimState IO) -> IO B.ByteString
= do
genString g <- uniformR (50 :: Int, 450 :: Int) g
randomLen <- replicateM randomLen $ uniformR (charRangeStart, charRangeEnd) g
str return $ B.pack str
------------------------------------------------------------------------------
writeCorpus :: FilePath -> IO ()
= withFile file WriteMode $ \h -> do
writeCorpus file let size = 100000
$ \gen ->
withSystemRandom $ do
replicateM_ size <- genString gen :: IO B.ByteString
text
CB.hPutStrLn h text
main :: IO ()
= writeCorpus "test.txt" main
The are a couple of interesting points:
As stated inside the docs withSystemRandom
is a somewhat expensive function, and is intended to be called only occasionally (e.g. once per thread). You should use the Gen
it creates to generate many random numbers). What we are doing is confining the Gen
generation is the “inside loop”. This guarantee that we get a brand new Gen
to pass to each invocation of genString
.
I’ve extrapolated c2w8
from the MissingH library from John Goerzen. We need this because uniformR
expects a type which is instance of Variate
, which is defined for Word8
but not for Char
.
As stated on Reddit and in the comments below, unfoldrN
would have been a great alternative to replicateM
to build our random string. Unfortunately, the signature of unfoldrN
reveals a pure nature: unfoldr :: (a -> Maybe (Word8, a)) -> a -> ByteString
, whereas uniformR
returns a monadic computation. In my opinion, albeit slower, replicateM
allows the creation of a new string without any visual clutter, keeping our code clean.
Not only is our code cleaner, but is also a great deal faster! On my machine generating 10000 words takes more or less half a second!
Well, a lot faster than PHP, as expected.