Last time I had lots of success using Control.Applicative with Data.Binary parsers. Now I’m onto parsing XML. I’ve done this a few times in Haskell using HXT and its funky Arrows. However, this time I was keen to see if I could factor out Yet More Code in an applicative stylee … and indeed I can.
Let’s start with an example of a round-trip XML parse-then-print. We’ll need all the imports and the arrow language extension eventually, so let’s get them out of the way now.
{-# LANGUAGE Arrows #-}
import Control.Applicative
import Text.XML.HXT.Arrow
import Text.XML.HXT.XPath.Arrows
import Text.Printf
xml = "<menu>\
\ <course name='starter'>Soup</course>\
\ <course name='mains'>Pie</course>\
\ <course name='dessert'>Cake</course>\
\</menu>"
testRoundtrip = runX $ readString [(a_remove_whitespace,v_1)] xml >>> writeDocumentToString []
Now lets parse the xml properly. Firstly, we’ll have a datatype to represent the menu itself:
data Menu = Menu { starter :: String, mains :: String, dessert :: String } deriving (Show)
The parsing code itself starts off by converting the String into a Tree, skipping down over the root node, and then trying to parse out a menu:
testParse = runX $ readString [(a_remove_whitespace,v_1)] xml >>> getChildren >>> parseMenu
Since we want to do multiple things with the “menu” node, we’ll use the special arrow syntax to name things:
parseMenu :: ArrowXml a => a XmlTree Menu
parseMenu = proc menu -> do
starter < - parseCourse "starter" -< menu
mains <- parseCourse "mains" -< menu
dessert <- parseCourse "dessert" -< menu
returnA -< Menu starter mains dessert
parseCourse name = getChildren >>> hasName "course" >>> hasAttrValue "name" (==name) >>> getChildren >>> getText
BTW, whilst the implementation of parseCourse is very combinatorically worthy, in practise I think I’d probably use the following version instead:
parseCourse name = getXPathTrees (printf "//course[@name='%s']/text()" name) >>> getText
Anyhow, looking back at the implementation of parseMenu, something bugs me. I had to give names to “starter”, “mains” and “dessert” despite only ever intending to use them to construct the final return value.
Sounds like a job for Control.Applicative!
However, sadly typing “:i Arrow” into ghci told me that my Arrow types weren’t Functors and therefore I couldn’t immediately perform the < $> and < *> trick to make a one-liner.
Boo!
But, a quick glance at Control.Applicative showed that the WrappedArrow type was a Functor. So, all I need to do is use the WrapArrow constructor to package up my arrows and, hey presto, applicative city here I come!
parseCourse name = WrapArrow $ getXPathTrees (printf "//course[@name='%s']/text()" name) >>> getText
parseMenu = unwrapArrow $ Menu < $>
(parseCourse "starter") < *>
(parseCourse "mains") < *>
(parseCourse "dessert")
Okay, so that’s all nice and concise. But it occurs to me that Control.Applicative is still all about *sequencing* computations. But why does parsing XML require sequencing? Surely it’s pure?
Indeed, this is where some of the flexibility of HXT kicks in. HXT supplies a variety of arrow types to fit your needs. If you want to be able to do some disk IO mid-parse, there’s an ArrowIO for you. If you want to maintain some user state during the parse, ArrowState is your friend. But you have the freedom to choose which one you want to use. If you are NOT performing IO and don’t need state, an ArrowList will do you nicely.
Furthermore, most the HXT combinators are uber-flexible. Their types are minimally constrained, allowing you to choose which arrow you want to work in (ie. an IO arrow, a State arrow or a pure List arrow). For example, the “getChildren” function works with any of those three. The HXT combinators which actually do IO (such as readDocument) understandably have a more specific type.
In the running Menu example, parseCourse is pretty flexible. It works with any arrow in the ArrowXml typeclass, which in turns only requires ArrowList behaviour (and a few others which aren’t relevant here).
Therefore, we can use parseCourse as a pure ArrowList, and do some ArrowList-plumbing to supply the three arguments to the Menu constructor. The end result will be a pure function between XmlTree’s and Menu’s. Furthermore, since we’re not using any effectful arrow combinators, we don’t need to nail down the sequencing.
The arr3 combinator converts a 3-arg function into an arrow whose input is a 3-tuple. Well, actually it’s a tuple that looks like (a,(b,c)). But that’s happily what the ‘plumbing’ combinators which we’ll use to plug in the three input arrows produce. So, without further ado:
parseMenuPure :: LA XmlTree Menu
parseMenuPure = arr3 Menu <<< (parseCourse "starter") &&& (parseCourse "mains") &&& (parseCourse "dessert")
testParsePure = do
[xml] < - runX $ readString [(a_remove_whitespace,v_1)] xml
return $ runLA (getChildren >>> parseMenuPure) xml
So, I’ve come full circle. Arrow syntax looks like monad syntax, and therefore suggests an ordering (I always make this mistake). Control.Applicative provides a concise way to express some common ordering patterns. But not all arrow types have effects which need ordering. If your arrows are pure, you can plumb them directly.
Update: I later realised where my confusion came from. I still had an early version of “testParse” in my file which did the xml reading and the call to parseMenu in one pipeline. This caused type inference to decide that the type of parseMenu was IOSLA (IO + State + List) just like readString is. I somehow assumed this was due to the explicit ordering demanded by the applicative operators – wrong! Once I noticed and removed my original “testParse”, the type of parseMenu reverted back to its natural type of “ArrowXml a => a XmlTree Entry). And that’s the case whether I use the arr3 version, the applicative version or the arrow syntax version. Hmm, so this tells me that I need to play around with Control.Applicative for some non-Monad types … I have an overly specific notion of “sequencing” which is tricking me.