Paginators are Mealy Machines in disguise
Paginators are Mealy Machines in disguise
Summary: At work I needed to stream some data out from a service which returned data in paginated chunks.
Using a very simple data type based on Mealy Machines worked surprisingly well.
One of the aspect I enjoy most of programming is when you have the chance of applying something you have learned in the real world. A couple of weeks ago I needed to create a tool to “garbage collect” old ECR images. Very simply put, ECR stands for EC2 Container Registry and is no more, no less, a private Docker Registry you can use as part of the impressive AWS (Amazon Web Service) toolkit.
ECR works by having a set of repositories, and for each of them you can upload up to 500 Docker images.
If you exceed this limit, you will have either to delete some images to create free space, or contact
the Amazon customer service to dump the upper limit up. My team uses ECR to store the images associated to
the Haskell micro services we deploy using Elastic Beanstalk,
and each time we do a deploy, we create and upload a new versioned image for each micro service, so it comes as no
surprise that space in the repositories is going to finish sooner or later. So the problem is simple: “I have some
images stored in the cloud and I need to delete them”.
Now, we could have used the quite impressive amazonka-ecr to solve
our problem, but historically we have been using the aws-cli even before Amazonka
was around, and sometimes is just super easy to slurp the output of a cli application and to decode that from JSON.
Short story short, this is why we didn’t piggyback on this excellent third-party library, but that’s a bit of an OT.
What’s important is that to solve our problem, we only need two commands from the
aws-cli: list-images
to retrieve all our images, and
batch-delete-image to
delete the ones which meet our criteria (in our case that would be deleting anything older than 2 months).
list-image doesn’t return the whole set of images, as this would be a beefy JSON! What it does instead,
is to return a JSON packet with a Token identifying if we have more data to fetch.
This is a standard pagination technique: other strategies would be, for example, to return the current page
and the total number of pages, so that the user can advance forward or backward.
We can easily model an ECRImage as a simple data type by following the specification on the list-image page:
-- Useful import to use in the rest of the post
import qualified Data.Aeson as JSON
import Data.Aeson.TH
import Data.Functor.Identity
import qualified Data.List as List
import Data.String.Conv
import qualified Data.Text as T
import Shelly
data ECRImage = ECRImage {
imageDigest :: T.Text
, imageTag :: Maybe T.Text
} deriving (Show, Eq)
deriveFromJSON defaultOptions { omitNothingFields = True } ''ECRImage
type NextToken = T.Text
data ECRListImages = ECRListImages {
nextToken :: Maybe NextToken
, imageIds :: [ECRImage]
}
deriveFromJSON defaultOptions { omitNothingFields = True } ''ECRListImagesThe ECRListImages is an umbrella type we define to parse the raw JSON that AWS gives us, which will include the
token AND the data fetched so far. When I approached this problem, I knew two things for sure:
- I didn’t want to fetch the whole dataset into memory, but rather processing it in chunks
- The problem itself was screaming “streaming!”
Although I could have simply written a recursive function which would fetch the current data and the token,
process it, and recur in case we still had data to fetch, that stroke me as a poor solution. Not bacause it
was intrinsically bad, but only because it felt a bit ad-hoc and didn’t compose very well. What if I wanted
to step through the data “one chunk” at the time? What if I wanted to filter each chunk according to a predicate
and retain only a subset of it? Sure, I could extend my function which a predicate to filter on, but that felt
even more ad-hoc. So I took a step backward and wondered if I could come up with a super tiny abstraction to
“step” through the data whilst retaining code reuse and composition. After some failing attempt, I came up with
this small data structure, which I’m calling here ForwardPaginator to stress the fact we cannot iterate
backward (yet), which is something I didn’t need to support anyway:
data ForwardPaginator m i o =
PaginatorLeaf o
| PaginatorFetch (Maybe i -> m (Maybe i, o, ForwardPaginator m i o))A ForwardPaginator effectively models a tree of computations; we can have a leaf, meaning we have just
started our machine and are at step zero, or a fetch step that, given an input i will produce a
triple (newInput, output, paginator), doing (or not) some monadic effect in the process
(thus the m wrapping). Due to the fact we could have exhausted our input, we encapsulate this
possibility in a Maybe, which explains the presence of those Maybe i. We can even define what seems
to be a legal Functor instance for it:
instance Functor m => Functor (ForwardPaginator m i) where
fmap f (PaginatorLeaf a) = PaginatorLeaf (f a)
fmap f (PaginatorFetch g) = PaginatorFetch $ \nextToken -> (\(x,y,z) -> (x, f y, fmap f z)) <$> g nextToken(Note to the reader: I have the intuition we should be able to define a Contravariant instance for our
ForwardPaginator, but I’m not 100% sure as i appears both in positive and negative position. A Profunctor
even? Please comment below or on Reddit if you think this is possible, I simply haven’t tried yet.)
If you squint hard, you will recognise that what we have in the PaginatorFetch step is essentially Mealy machine!
This is not very surprising;
Neil Mitchell used it in Shake
only to discover his data structure was indeed a Mealy machine and his definition was almost verbatim to the
one included in the machine package.
What I find cool is that both me and Neil went through the same creative process; we modeled our solution using an
abstraction we later found out be something already present in literature! I find both depressing and invigorating to
discover that your clever idea is something someone thought about a long time before you! Oh well, at least that
gave me the confidence I was on the right track. Incidentally,
Ollie blogged
in 2013 about FRP and Netwire, and guess what his Auto type looks like ;)
The reader might be thinking by now “Ok, but what can you do with this?” A Mealy machine is something very simple at
its heart, and copying its definition from Wikipedia “…is a finite-state machine whose output values are
determined both by its current state and the current inputs.[…]”. Simply put, we can use the current state and
the current input(s) to decide where to go next (which could be advance the machine or stop altogether). To be
completely honest with you, whilst writing this blog post, I was on the fence about considering what’s inside a
PaginatorFetch a Mealy or a Moore machine, as it resemble a bit of both, but I eventually settle on the former,
as effectively it’s the input (the “token”) which determines if we can step further or not.
Armed with our ForwardPaginator, let’s generalise it to our domain problem:
type NextToken = T.Text
type Repository = T.Text -- Will use this later
type ECRPaginator m a = ForwardPaginator m NextToken aNow, let’s get the elephant out of the room and let me give you the (rather) uninteresting definition of our ECR paginator. I personally think that the semantic of the data structure and the operations we can perform of it are much more interesting, but I wanted to post a “real world” paginator just to prove this stuff can also pay the bills ;)
ecrListImagesPaginated :: Repository -> ECRPaginator Sh [ECRImage]
ecrListImagesPaginated repo = PaginatorFetch $ \_ -> do
initialState <- run "aws" cmd
case JSON.eitherDecode (toS initialState) of
Left ex -> yieldZero ex
Right (ECRListImages Nothing items) -> return (Nothing, items, PaginatorLeaf mempty)
Right (ECRListImages mbToken items) -> return (mbToken, items, fetch)
where
yieldZero ex = do
echo "aws ecr list-images failed to decode to valid JSON. Error was: "
echo (toS . show $ ex)
return (Nothing, mempty, PaginatorLeaf mempty)
cmd = [ "ecr"
, "list-images"
, "--region"
, "eu-west-1"
, "--repository-name"
, repo
]
fetch :: ECRPaginator Sh [ECRImage]
fetch = PaginatorFetch $ \token -> case token of
Nothing -> return (Nothing, mempty, PaginatorLeaf mempty)
Just t -> do
rawJson <- run "aws" cmd
case JSON.eitherDecode (toS rawJson) of
Left ex -> yieldZero ex
Right (ECRListImages mbToken items) -> return (mbToken, items, fetch)The caveat here is that we need to repeat the call to aws ecr twice as we need to call it at least once to
acquire a valid token, so that externally we will be able to pass Nothing to our paginator to start it. I have
chosen Sh as my monad of choice (from the shelly package), so that
I can run bash commands easily.
Now the fun begins! What we can do with this paginator and more generally with a ForwardPaginator?
The first operation we can think of is effectively “stepping” the paginator, and implementing this function is not very hard:
next :: Monad m
=> ForwardPaginator m i a
-> Maybe i
-- ^ The initial input state to use.
-> m (Maybe i, a, ForwardPaginator m i a)
next (PaginatorLeaf i) _ = return (Nothing, i, PaginatorLeaf i)
next (PaginatorFetch cont) tkn = cont tknNote how this function is completely generic in terms of m, i and a, apart from the Monad constraint,
which means I can “step” arbitrary paginators – talk about code reuse! Another thing we might want is to be
evil and fold all the data returned from the paginator into a giant collection. Not hard as well:
foldPaginator :: (Monad m, Monoid a) => ForwardPaginator m i a -> Maybe i -> a -> m a
foldPaginator (PaginatorLeaf items) _ acc = return (items `mappend` acc)
foldPaginator (PaginatorFetch cont) tkn acc = do
(t', acc', res) <- cont tkn
case res of
leaf@(PaginatorLeaf _) -> foldPaginator leaf Nothing acc
nextFetch -> foldPaginator nextFetch t' acc'Again, the only constrain is that our accumulator must be a Monoid, so that we can effectively
concatenate all the results together. This would also effectively allow us to return all the ECRImage(s)
at once, but beware that this would load them into memory – not recommended for your production services!
ecrListImages :: Repository -> IO [ECRImage]
ecrListImages repo = shelly $ foldPaginator (ecrListImagesPaginated repo) Nothing memptySomething nice we can do with a ForwardPaginator is being able to find a particular element matching
a predicate, short-circuiting our paginator as soon as we find a match, in order to avoid work and more
generally expensive calls to external services:
findPaginator :: Monad m
=> ForwardPaginator m i [a]
-> Maybe i
-> (a -> Bool)
-> m (Maybe a)
findPaginator (PaginatorLeaf v) _ prd = return $ List.find prd v
findPaginator (PaginatorFetch cont) tkn prd = do
(t',items,cont') <- cont tkn
case List.find prd items of
Just i -> return $ Just i
Nothing -> findPaginator cont' t' prdPure or impure? Pick your monad!
To wrap up this blog post, I also wanted to show you how we are not bounded to use a “impure” monad for
our ForwardPaginator: we could use something like Identity, State, Reader and so on and so forth. As
an example, we will create a ForwardPaginator which can be built out of a pure function (full disclosure: fib,
the classic, hehe) and everything will be pure to please the Haskell gods. Let’s start by defining both our
pure function and the associated paginator:
fib :: Int -> Integer
fib 0 = 0
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)
fibPaginator :: ForwardPaginator Identity Int Integer
fibPaginator = PaginatorFetch $ \continue -> case continue of
Nothing -> return (Just 1, fib 0, fibPaginator)
Just i -> return (Just $ i + 1, fib i, fibPaginator)The slight twist is that in case we have no initial input, we return the base case of the recursion, otherwise
we iterate in an infinite fashion, exactly like the original fib function. Now we can easily step the paginator
using next to get one result at time, or create a convenient take function to get values out our infinite
stream:
takePaginator :: (Monad m) => ForwardPaginator m i a -> Maybe i -> Int -> m [a]
takePaginator (PaginatorLeaf v) _ _ = return [v]
takePaginator (PaginatorFetch cont) tkn n
| n <= 0 = return []
| otherwise = do
(newToken, o, newPaginator) <- cont tkn
(o :) <$> takePaginator newPaginator newToken (n - 1)Using it is simple enough:
ghci> runIdentity $ takePaginator fibPaginator Nothing 10
[0,1,1,2,3,5,8,13,21,34]As a bonus, as ForwardPaginator is a functor, we can easily map a function on the output values as we stream them:
ghci> runIdentity $ takePaginator ((*2) <$> fibPaginator) Nothing 10
[0,2,2,4,6,10,16,26,42,68]Stepping backwards
A bit of a pet peeve the reader migth have with this paginator is that is lacks the ability to step
backward, and that would certainly be a valid concern. I still think though that adding the ability to
iterate backward should be possible provided that we create a function back which bound the paginator
monad to be a MonadState (Maybe i), so that we can store the previous token and go backward and forward
as we please. Something like this, for example:
prev :: MonadState (Maybe i) m
=> ForwardPaginator m i a
-> Maybe i
-- ^ The initial input state to use.
-> m (Maybe i, Maybe a, ForwardPaginator m i a)I think we need to yield a Maybe a as output in case we want to step backward but we are already at the first
“page”: in that case, we should yield no result. Maybe, if the readers are interested, I could explore this
possibility in a subsequent blog post, which should effectively give us a Paginator worth its name, to be
used in each scenario which requires bidirectional pagination.
Conclusions
The ideas presented here are very simple but at the same time quite effective; They allowed me to solve my
original problem in a nice compact way. Using findPaginator and the Functor instance I was able to first
stop as soon as the current result set contained values I was interested in, and I was able to “zoom” only on
pieces of the ECRImage data structure to extract things like the ImageDigest. So, next time you need to
implement some form of pagination, remember you have the arsenal of Mealy and Moore Machines at your disposal:
it’s not surprising they are called stream transducers!