Extracting from .bib file with Perl 6

2019-03-12 13:18发布

问题:

I have this .bib file for reference management while writing my thesis in LaTeX:

@article{garg2017patch,
  title={Patch testing in patients with suspected cosmetic dermatitis: A retrospective study},
  author={Garg, Taru and Agarwal, Soumya and Chander, Ram and Singh, Aashim and Yadav, Pravesh},
  journal={Journal of Cosmetic Dermatology},
  year={2017},
  publisher={Wiley Online Library}
}

@article{hauso2008neuroendocrine,
  title={Neuroendocrine tumor epidemiology},
  author={Hauso, Oyvind and Gustafsson, Bjorn I and Kidd, Mark and Waldum, Helge L and Drozdov, Ignat and Chan, Anthony KC and Modlin, Irvin M},
  journal={Cancer},
  volume={113},
  number={10},
  pages={2655--2664},
  year={2008},
  publisher={Wiley Online Library}
}

@article{siperstein1997laparoscopic,
  title={Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases},
  author={Siperstein, Allan E and Rogers, Stanley J and Hansen, Paul D and Gitomirsky, Alexis},
  journal={Surgery},
  volume={122},
  number={6},
  pages={1147--1155},
  year={1997},
  publisher={Elsevier}
}

If anyone wants to know what bib file is, you can find it detailed here.

I'd like to parse this with Perl 6 to extract the key along with the title like this:

garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study

hauso2008neuroendocrine: Neuroendocrine tumor epidemiology

siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases

Can you please help me to do this, maybe in two ways:

  1. Using basic Perl 6
  2. Using a Perl 6 Grammar

回答1:

This answer is aimed at being both:

  • An introductory general answer to "I want to parse X with Perl 6. Can anyone help?"

  • A complete and detailed answer that does exactly as @Suman asks.


In a single statement (power user)

"$_[0]: $_[1]\n" .put
  for (slurp 'derm.bib')
    ~~ m:g/ '@article{' (<-[,]>+) ',' \s+ 'title={' ~ '}' (<-[}]>+) /

(Run this code in tio.run.)

I decided to start with the sort of thing a dev familiar with P6 would write in a few minutes to do just the simple task you've specified in your question if they didn't much care about readability for newbies.

I'm not going to provide an explanation of it. It just does the job. If you're a P6 newbie it could well be overwhelming. If so, please read the rest of my answer -- it takes things slower and has comprehensive commentary. Perhaps return here and see if it makes more sense after reading the rest.

A "basic Perl 6" solution

my \input      = slurp 'derm.bib' ;

my \pattern    = rule { '@article{'       ( <-[,]>+ ) ','
                          'title={' ~ '}' ( <-[}]>+ ) }

my \articles   = input.match: pattern, :global ;

for articles -> $/ { print "$0: $1\n\n" }

This is almost identical to the "single statement (power user)" code -- broken into four statements rather than one. I could have made it more closely copy the first version of the code but have instead made a few changes that I'll explain. I've done this to make it clearer that P6 deliberately has its features be a scalable and refactorable continuum so one can mix and, er, match whatever features best fits a given use case.

my \input      = slurp 'derm.bib' ;

Perls are famous for their sigils. In P6, if you don't need them you can "slash" them out. Perls are also famous for having terse ways of doing things. slurp reads a file in its entirety in one go.

my \pattern    = rule { '@article{'       ( <-[,]>+ ) ','
                          'title={' ~ '}' ( <-[}]>+ ) }

Perl 6 patterns are generically called regexes or Rules. There are several types of regexes/rules. The pattern language is the same; the distinct types just direct the matching engine to modify how it handles a given pattern.

One regex/rule type is the P6 equivalent of classic regexes. These are declared with either /.../ or regex {...}. The regex in the opening "power user" code was one of these regexes. Their distinction is that they backtrack when necessary, just like classic regexes.

There's no need for backtracking to match the .bib format. Unless you need backtracking, it's wise to consider using one of the other rule types instead. I've switched to a rule declared with the keyword rule.

A rule declared with rule is identical to one declared with regex (or /.../) except that A) it doesn't backtrack and B) it interprets spaces in its pattern as corresponding to possible spaces in the input. Did you spot that I'd dropped the \s+ from the pattern immediately before 'title={'? That's because a rule takes care of that automatically.

The other difference is that I wrote:

'title={' ~ '}' ( ... )

instead of:

'title={' ( ... ) '}'

i.e. moving the pattern matching the bit between the braces after the braces and putting a ~ in between the braces instead. They match the same overall pattern. I could have written things either way in the power user /.../ pattern and either way in this section's rule pattern. But I wanted this section to be a bit more "best practice" oriented. I'll defer a full explanation of this difference and all the other details of this pattern until the Explanation of 'bib' grammar section below.

my \articles   = input.match: pattern, :global ;

This line uses the method form of the m routine used in the earlier "power user" version.

:global is the same as :g. I could have written it either way in both versions.

Add :global (or :g) to the argument list when invoking the .match method (or m routine) if you want to search the entire string being matched, finding as many matches as there are, not just the first. The method (or m routine) then returns a list of Match objects rather than just one. In this case we'll get three, corresponding to the three articles in the input file.

for articles -> $/ { print "$0: $1\n\n" }

Per P6 doc on $/, "$/ is the match variable ... so usually contains objects of type Match.". It also provides some other conveniences and we take advantage of one of these conveniences here as explained next.

The for loop successively binds each of the overall Match objects (corresponding to each of the articles in your sample file that were successfully parsed by the grammar) to the symbol $/ inside the for's block.

The pattern contains two pairs of parentheses. These generate "Positional captures". The overall Match object provides access to its two Positional captures via Positional subscripting (postfix []). Thus, within the for block, $/[0] and $/[1] provide access to the two Positional captures for a given article. But so do $0 and $1 -- because standard P6 aliases these latter symbols to $/[0] and $/[1] for your convenience.


Still with me?

The latter half of this answer builds up and thoroughly explains a grammar-based approach. Reading it may provide further insight into the solutions above.

But first...

A "boring" practical answer

I want to parse this with Perl 6. Can anyone help?

P6 may make writing parsers less tedious than with other tools. But less tedious is still tedious. And P6 parsing is currently slow.

In most cases, the practical answer when you want to parse anything beyond the most trivial of file formats -- especially a well known format that's several decades old -- is to find and use an existing parser.

You might start with a search for 'bib' on modules.perl6.org in the hope of finding a publicly shared 'bib' parsing module. Either a pure Perl 6 one or some P6 wrapper around a non-P6 library. But at the time of writing this there are no matches for 'bib'.

There's almost certainly a 'bib' parsing C library already available. And it's likely to be the fastest solution. It's also likely that you can easily and elegantly use an external parsing library packaged as a C lib, in your own P6 code, even if you don't know C. If NativeCall is either too much or too little explanation, consider visiting the freenode IRC channel #perl6 and asking for whatever NativeCall help you need or want.

If a C lib isn't right for a particular use case then you can probably still use packages written in Perl 5, Python, Ruby, Lua, etc. via their Inline::* language adapters. Just install the Perl 5, Python or whatever package that you want; make sure it runs using that other language; install the appropriate language adapter; then use the package and its features as if it were a P6 package containing exported P6 functions, classes, objects, values, etc.

The Perl 5 adapter is the most mature so I'll use that as an example. Let's say you use Perl 5's Text::BibTex packages and now wish to use Perl 6 with the existing Text::BibTeX::BibFormat module from Perl 5. First, setup the Perl 5 packages as they are supposed to be per their README's etc. Then, in Perl 6, write something like:

use Text::BibTeX::BibFormat:from<Perl5>;
...
@blocks = $entry.format;

The first line is how you tell P6 that you wish to load a P5 module. (It won't work unless Inline::Perl5 is already installed and working. But it should be if you're using a popular Rakudo Perl 6 bundle. And if not, you should at least have the module installer zef so you can run zef install Inline::Perl5.)

The last line is just a mechanical P6 translation of the @blocks = $entry->format; line from the SYNOPSIS of the Perl 5 Text::BibTeX::BibFormat.

Creating a P6 grammar / parser

OK. Enough "boring" practical advice. Let's now try have some fun creating a P6 parser good enough for the example from your question.

# use Grammar::Tracer;

grammar bib {

    rule TOP           { <article>* }

    rule article       { '@article{' $<id>=<-[,]>+ ','
                            <kv-pairs>
                         '}'
    }

    rule kv-pairs      { <kv-pair>* % ',' }

    rule kv-pair       { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }

}

With this grammar in place, we can now write something like:

die "Maybe use Grammar::Tracer?" unless bib.parsefile: 'derm.bib';

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

to generate exactly the same output as with the earlier "power user" and "basic Perl 6" solutions -- but using a grammar / parser approach.

Explanation of 'bib' grammar

# use Grammar::Tracer;

If a parse fails, the return value is Nil. P6 won't tell you how far it got. You'll have zero clue why your parse failed.

If you don't have a better option (?), then, when your grammar fails, use Grammar::Tracer to help debug (installing it first if you don't already have it installed).

grammar bib {

The grammar keyword is like class, but a grammar can contain not just named methods as usual but also named regexs, tokens, and rules.

    rule TOP           {

Unless you specify otherwise, parsing routines start out by calling the rule (or token, regex, or method) named TOP.

As a, er, rule of thumb, if you don't know if you should be using a rule, regex, token, or method for some bit of parsing, use a token. (Unlike regex patterns, tokens don't backtrack so they eliminate the risk of unnecessarily running slowly due to backtracking.)

But in this case I've used a rule. Like token patterns, rules also avoid backtracking. But in addition they take whitespace following any atom in the pattern to be significant in a natural manner. This is typically appropriate towards the top of the parse tree. (Tokens, and the occasional regex, are typically appropriate towards the leaves.)

    rule TOP           { <article>* }

The space at the end of the rule means the grammar will match any amount of whitespace at the end of the input.

<article> invokes another named rule (or token/regex/method) in this grammar.

Because it looks like one should allow for any number of articles per bib file, I added a * (zero or more quantifier) at the end of <article>*.

    rule article       { '@article{' $<id>=<-[,]>+ ','
                            <kv-pairs>
                         '}'
    }

I sometimes lay rules out to resemble the way typical input looks. I tried to do so here.

<[...]> is the P6 syntax for a character class, like[...] in traditional regex syntax. It's more powerful but for now all you need to know is that the - in <-[,]> indicates negation, i.e. the same as the ^ in ye olde [^,] syntax. So <-[,]>+ attempts a match of one or more characters, none of which are ,.

$<id>=<-[,]>+ tells P6 to attempt to match the quantified atom on the right of the = (i.e. the <-[,]>+ bit) and store the results at the key 'id' within the current Match object. The latter will be hung from a branch of the parse tree; we'll get to precisely where later.

    rule kv-pairs      { <kv-pair>* % ',' }

This regex code illustrates one of several convenient P6 regex features. It says you want to match zero or more kv-pairs separated by commas.

(In more detail, the % regex infix operator requires that matches of the quantified atom on its left are separated by the atom on its right.)

    rule kv-pair       { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }

The new bit here is '={' ~ '}'. This is another convenient regex feature. The regex Tilde operator parses a delimited structure (in this case one with a ={ opener and } closer) with the bit between the delimiters matching the quantified regex atom on the right of the closer. This confers several benefits but the main one is that error messages can be much clearer.

An explanation of the parse tree's construction/deconstruction

The $<article> and .<id> etc. bits in the last line (for $<article> { say .<id> ~ ':' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }) refer to Match objects that are stored in the parse tree that's generated and returned from a successful parse.

Returning to the top of the grammar:

    rule TOP           {

If a parse is successful, a single 'TOP' level Match object, the one corresponding to the top of the parse tree, is returned. (It's also made available to code immediately following the parse method call via the variable $/.)

But before that final return from parsing happens, many other Match objects, representing sub parts of the overall parse, will have been generated and added to the parse tree. Addition of Match objects to a parse tree is done by assigning either a single generated Match object, or a list of them, to either a Positional or Associative element of a "parent" Match object, as explained next.

    rule TOP           { <article>* }

A rule invocation like <article> has two effects. First, P6 tries to match the rule. Second, if it matches, P6 generates a corresponding Match object and adds it to the parse tree.

If the successfully matched pattern had been just <article>, rather than <article>*, then only one match would have been attempted and only one value, a single Match object, would have been generated and added to the parse tree.

But the pattern was <article>*, not merely <article>. So P6 attempts to match the article rule multiple times. If it matches at least once then it generates and stores a corresponding list of one or more Match objects. (See my answer to "How do I access the captures within a match?" for a more detailed explanation.)

So a list of Match objects is assigned to the 'article' key of the TOP level Match object. (If the matching regex expression had been just <article> rather than <article>* then a match would result in just a single Match object being assigned to the 'article' key rather than a list of them.)

So now I'll try to explain the $<article> part of the last line of code, which was:

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

$<article> is short for $/.<article>.

Per P6 doc on $/, "$/ is the match variable. It stores the result of the last Regex match and so usually contains objects of type Match.".

The last Regex match in our case was the TOP rule from the bib grammar.

So $<article> is the value under the 'article' key of the TOP level Match object returned by the parse. This value is a list of 3 'article' level Match objects.

    rule article       { '@article{' $<id>=<-[,]>+ ','

The article regex in turn contains $<id> on the left side of an assignment. This corresponds to assigning a Match object to a new 'id' key added to the article level Match object.

Hopefully this is enough (perhaps too much!) and I can now explain the last line of code, which, once again, was:

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

The for iterates over the list of 3 Match objects (corresponding to the 3 articles in the input) that were generated during the parse and stored under the 'article' key of the TOP level Match object.

(This iteration automatically assigns each of these three sub Match objects to $_, aka "it" or "the topic", and then, after each assignment, does the code in the block ({ ... }). The code in the block will typically refer, either explicitly or implicitly, to $_.)

The .<id> bit in the block is equivalent to $_.<id>, i.e. it implicitly refers to $_. As just explained, $_ is the article level Match object being processed this time around the for loop. The <id> bit means .<id> returns the Match object stored under the 'id' key of the article level Match object.

Finally, the .<kv-pairs><kv-pair>[0]<value> bit refers to the Match object stored under the 'value' key of the Match object stored as the first (0th) element of the list of Match objects stored under the kv-pair key of the Match object corresponding to the kv-pairs rule which in turn is stored under the 'kv-pairs' key of an article level Match object.

Phew!

When the automatically generated parse tree isn't what you want

As if all the above were not enough, I need to mention one more thing.

The parse tree strongly reflects the implicit tree structure of the grammar. But getting this structure as a result of a parse is sometimes inconvenient -- one may want a different tree structure instead, perhaps a much simpler tree, perhaps some non-tree data structure.

The primary mechanism for generating exactly what you want from a parse when the automatic results aren't suitable is use of make. (This can be used in code blocks inside rules or factored out into Action classes that are separate from grammars.)

In turn, the primary use case for make is to generate a sparse tree of nodes hanging off the parse tree.

Finally, the primary use case for these sparse trees is storing an AST.