I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently .
Right now I'm using NLTK's parser which produces the right output but is very slow. For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. Lexical entries have little to no effect on performance.
Another problem is that loading the (25MB) grammar+lexicon at the beginning can take up to a minute.
From what I can find in literature, the running time of the algorithm used to parse such a grammar (Earley or CKY) should be linear to the size of the grammar and cubic to the size of the input token list. My experience with NLTK indicates that ambiguity is what hurts the performance most, not the absolute size of the grammar.
So now I'm looking for a CFG parser to replace NLTK. I've been considering PLY but I can't tell whether it supports feature structures in CFGs, which are required in my case, and the examples I've seen seem to be doing a lot of procedural parsing rather than just specifying a grammar. Can anybody show me an example of PLY both supporting feature structs and using a declarative grammar?
I'm also fine with any other parser that can do what I need efficiently. A Python interface is preferable but not absolutely necessary.
Somewhat late on this, but here are two more options for you:
Spark is a Earley parser written in Python.
Elkhound is a GLR parser written in C++ Elkhound uses a Bison like syntax
I think ANTLR is the best parser-generator that I know of for Java. I don't know if Jython would provide you a good way for Python and Java to interact.
I've used pyparsing for limited vocabulary command parsing, but here is a little framework on top of pyparsing that addresses your posted example:
Prints:
This is likely to get slow as you expand the vocabulary. Half a million entries? I thought that a reasonable functional vocabulary was on the order of 5-6 thousand words. And you will be pretty limited in the sentence structures that you can handle - natural language is what NLTK is for.
Tooling aside...
You may remember from theory that there are infinite grammars that define the same language. There are criteria for categorizing grammars and determining which is the "canonical" or "minimal" one for a given language, but in the end, the "best" grammar is the one that's more convenient for the task and tools at hand (remember the transformations of CFGs into LL and LR grammars?).
Then, you probably don't need a huge lexicon to parse an sentence in English. There's a lot to be known about a word in languages like German or Latin (or even Spanish), but not in the many times arbitrary and ambiguos English. You should be able to get away with a small lexicon that contains only the key words necessary to arrive to the structure of a sentence. At any rate, the grammar you choose, no matter its size, can be cached in a way that thee tooling can directly use it (i.e., you can skip parsing the grammar).
Given that, it could be a good idea to take a look at a simpler parser already worked on by someone else. There must be thousands of those in the literature. Studying different approaches will let you evaluate your own, and may lead you to adopt someone else's.
Finally, as I already mentioned, interpreting natural languages is much more about artificial intelligence than about parsing. Because structure determines meaning and meaning determines structure you have to play with both at the same time. An approach I've seen in the literature since the '80s is to let different specialized agents take shots at solving the problem against a "blackboard". With that approach syntatic and semantic analysis happen concurrently.
By all means take a look at Pyparsing. It's the most pythonic implementations of parsing I've come across, and it's a great design from a purely academic standpoint.
I used both ANTLR and JavaCC to teach translator and compiler theory at a local university. They're both good and mature, but I wouldn't use them in a Python project.
That said, unlike programming languages, natural languages are much more about the semantics than about the syntax, so you could be much better off skipping the learning curves of existing parsing tools, going with a home-brewed (top-down, backtracking, unlimited lookahead) lexical analyzer and parser, and spending the bulk of your time writing the code that figures out what a parsed, but ambiguous, natural-language sentence means.
I would recommend using bitpar, a very efficient PCFG parser written in C++. I've written a shell-based Python wrapper for it, see https://github.com/andreasvc/eodop/blob/master/bitpar.py