Parsing and the use of GADTs

I've ran into a problem while writing a parser. Specifically, I want to be return values of different types. For example, I have two different data types FA and PA to represent two different lipid classes -

data FA = ClassLevelFA IntegerMass
        | FA           CarbonChain
        deriving (Show, Eq, Ord)

data PA   = ClassLevelPA       IntegerMass
          | CombinedRadylsPA   TwoCombinedRadyls
          | UnknownSnPA        Radyl Radyl
          | KnownSnPA          Radyl Radyl
          deriving (Show, Eq, Ord)

Using attoparsec, I have built parsers to parse lipid shorthand notation. For the data types above, I have the parsers faParser and paParser. I would like to be able to parse for either an FA or PA. However, since FA and PA are different data types, I cannot do the following -

inputParser =  faParser
           <|> paParser

I have recently learnt about GADTs and I thought this would solve my problem. Consequently I went and made a GADT and an eval function and changed the parsers faParser and paParser. -

data ParsedLipid a where
  ParsedFA :: FA -> ParsedLipid FA
  ParsedPA :: PA -> ParsedLipid PA

eval :: ParsedLipid a -> a
eval (ParsedFA val) = val
eval (ParsedPA val) = val

This gets me close but it appears as if ParsedFA and ParsedPA are different data types? For example, parsing "PA 17:1_18:1" gives me a value of type ParsedLipid PA (not just ParsedLipid as I was expecting). Therefore, the parser inputParser still does not type check.

let lipid = use "PA 17:1_18:1"
*Main> :t lipid
lipid :: ParsedLipid PA

Any advice on how to get around this problem?

回答1:

@MathematicalOrchid points out that you probably don't need GADTs and a plain sum type might be enough. You may have an XY problem but I don't know enough about your use case to make a firm judgment. This answer is about how to turn untyped data into a GADT.

As you note, your parsing function can't return a ParsedLipid a because that leaves the calling context free to choose a which doesn't make sense; a is in fact determined by the input data. And you can't return a ParsedLipid FA or a ParsedLipid PA, because the input data may be of either type.

Therefore, the standard trick when building a GADT from runtime data - when you don't know the type of the index in advance - is to use existential quantification.

data AParsedLipid = forall a. AParsedLipid (ParsedLipid a)

The type parameter a appears on the right-hand side of AParsedLipid but not on the left. A value of AParsedLipid is guaranteed to contain a well-formed ParsedLipid, but its precise type is kept secret. An existential type is a wrapper which helps us to translate from the messy, untyped real world to a clean, strongly-typed GADT.

It's a good idea to keep the existentially quantified wrappers pushed to the edges of your system, where you have to communicate with the outside world. You can write functions with signatures like ParsedLipid a -> a in your core model and apply them to data under an existential wrapper at the edges. You validate your input, wrap it in an existential type, and then process it safely using your strongly-typed model - which doesn't have to worry about errors, because you already checked your input.

You can unpack an AParsedLipid to get your ParsedLipid back, and pattern-match on that to determine at runtime what a was - it'll either be FA or PA.

showFA :: FA -> String
showFA = ...
showPA :: PA -> String
showPA = ...

showLipid :: AParsedLipid -> String
showLipid (AParsedLipid (ParsedFA x)) = "AParsedLipid (ParsedFA "++ showFA x ++")"
showLipid (AParsedLipid (ParsedPA x)) = "AParsedLipid (ParsedPA "++ showPA x ++")"

You'll note that a can't appear in the return type of a function taking an AParsedLipid either, for the reasons outlined above. The return type must be known to the compiler; this technique does not allow you to define a "function with different return types".

When you're constructing an AParsedLipid, you have to generate enough evidence to convince the compiler that the wrapped ParsedLipid is well-formed. In your example, this comes down to parsing a well-typed PA or FA and then wrapping it up.

parser :: Parser AParsedLipid
parser = AParsedLipid <$> (fmap ParsedFA faParser <|> fmap ParsedPA paParser)

GADTs are a bit awkward when used with runtime data. The existential wrapper effectively erases the extra compile-time information in a ParsedLipid - AParsedLipid is isomorphic to Either FA PA. (Proving that isomorphism in code is a good exercise.) In my experience GADTs are much better at structuring programs than at structuring data - they excel at implementing strongly-typed embedded domain-specific languages for which the types' indices can be known at compile time. For example, Yampa and extensible-effects both use a GADT as their central data type. This helps the compiler to check that you're using the domain-specific language correctly in the code you write (and in some cases allows certain optimisations). It's pretty unlikely that you'll be building FRP networks or monadic effects at runtime from real-world data.

回答2:

What exactly do you want?

If you know at compile time whether you want an FA or a PA, then GADTs are a good way to do that.

If you want to decide at run-time to parse either an FA or a PA, you could use... Either FA PA.