I am a college student getting my Computer Science degree. A lot of my fellow students really haven't done a lot of programming. They've done their class assignments, but let's be honest here those questions don't really teach you how to program.
I have had several other students ask me questions about how to parse things, and I'm never quite sure how to explain it to them. Is it best to start just going line by line looking for substrings, or just give them the more complicated lecture about using proper lexical analysis, etc. to create tokens, use BNF, and all of that other stuff? They never quite understand it when I try to explain it.
What's the best approach to explain this without confusing them or discouraging them from actually trying.
What is parsing?
In computer science, parsing is the process of analysing text to determine if it belongs to a specific language or not (i.e. is syntactically valid for that language's grammar). It is an informal name for the syntactic analysis process.
For example, suppose the language
a^n b^n
(which means same number of characters A followed by the same number of characters B). A parser for that language would acceptAABB
input and reject theAAAB
input. That is what a parser does.In addition, during this process a data structure could be created for further processing. In my previous example, it could, for instance, to store the
AA
andBB
in two separate stacks.Anything that happens after it, like giving meaning to
AA
orBB
, or transform it in something else, is not parsing. Giving meaning to parts of an input sequence of tokens is called semantic analysis.What isn't parsing?
What is the simplest way to understand it?
I think the best way for understanding the parsing concept is to begin with the simpler concepts. The simplest one in language processing subject is the finite automaton. It is a formalism to parsing regular languages, such as regular expressions.
It is very simple, you have an input, a set of states and a set of transitions. Consider the following language built over the alphabet
{ A, B }
,L = { w | w starts with 'AA' or 'BB' as substring }
. The automaton below represents a possible parser for that language whose all valid words starts with 'AA' or 'BB'.It is a very simple parser for that language. You start at
(q0)
, the initial state, then you read a symbol from the input, if it isA
then you move to(q1)
state, otherwise (it is aB
, remember the remember the alphabet is onlyA
andB
) you move to(q2)
state and so on. If you reach(qf)
state, then the input was accepted.As it is visual, you only need a pencil and a piece of paper to explain what a parser is to anyone, including a child. I think the simplicity is what makes the automata the most suitable way to teaching language processing concepts, such as parsing.
Finally, being a Computer Science student, you will study such concepts in-deep at theoretical computer science classes such as Formal Languages and Theory of Computation.
I'd explain parsing as the process of turning some kind of data into another kind of data.
In practice, for me this is almost always turning a string, or binary data, into a data structure inside my Program.
For example, turning
into (C)
Parsing is about READING data in one format, so that you can use it to your needs.
I think you need to teach them to think like this. So, this is the simplest way I can think of to explain parsing for someone new to this concept.
Generally, we try to parse data one line at a time because generally it is easier for humans to think this way, dividing and conquering, and also easier to code.
We call field to every minimum undivisible data. Name is field, Age is another field, and Surname is another field. For example.
In a line, we can have various fields. In order to distinguish them, we can delimit fields by separators or by the maximum length assign to each field.
For example: By separating fields by comma
Paul,20,Jones
Or by space (Name can have 20 letters max, age up to 3 digits, Jones up to 20 letters)
Paul 020Jones
Any of the before set of fields is called a record.
To separate between a delimited field record we need to delimit record. A dot will be enough (though you know you can apply CR/LF).
A list could be:
Michael,39,Jordan.Shaquille,40,O'neal.Lebron,24,James.
or with CR/LF
Michael,39,Jordan
Shaquille,40,O'neal
Lebron,24,James
You can say them to list 10 nba (or nlf) players they like. Then, they should type them according to a format. Then make a program to parse it and display each record. One group, can make list in a comma-separated format and a program to parse a list in a fixed size format, and viceversa.
Parsing to me is breaking down something into meaningful parts... using a definable or predefined known, common set of part "definitions".
For programming languages there would be keyword parts, usable punctuation sequences...
For pumpkin pie it might be something like the crust, filling and toppings.
For written languages there might be what a word is, a sentence, what a verb is...
For spoken languages it might be tone, volume, mood, implication, emotion, context
Syntax analysis (as well as common sense after all) would tell if what your are parsing is a pumpkinpie or a programming language. Does it have crust? well maybe it's pumpkin pudding or perhaps a spoken language !
One thing to note about parsing stuff is there are usually many ways to break things into parts.
For example you could break up a pumpkin pie by cutting it from the center to the edge or from the bottom to the top or with a scoop to get the filling out or by using a sledge hammer or eating it.
And how you parse things would determine if doing something with those parts will be easy or hard.
In the "computer languages" world, there are common ways to parse text source code. These common methods (algorithims) have titles or names. Search the Internet for common methods/names for ways to parse languages. Wikipedia can help in this regard.
Have them try to write a program that can evaluate arbitrary simple arithmetic expressions. This is a simple problem to understand but as you start getting deeper into it a lot of basic parsing starts to make sense.
In linguistics, to divide language into small components that can be analyzed. For example, parsing this sentence would involve dividing it into words and phrases and identifying the type of each component (e.g.,verb, adjective, or noun).
Parsing is a very important part of many computer science disciplines. For example, compilers must parse source code to be able to translate it into object code. Likewise, any application that processes complex commands must be able to parse the commands. This includes virtually all end-user applications.
Parsing is often divided into lexical analysis and semantic parsing. Lexical analysis concentrates on dividing strings into components, called tokens, based on punctuationand other keys. Semantic parsing then attempts to determine the meaning of the string.
http://www.webopedia.com/TERM/P/parse.html