I want to create a program that reads text from a file and points out when "a" and "an" is used incorrect. The general rule as far as I know is that "an" is used when the next words starts with a vowel. But it should also take into consideration that there are exceptions which also should be read from a file.
Could someone give me some tips and tricks on how I should get started with this. Functions or so that could help.
I would be very glad :-)
I'm quite new to Python.
I'd probably start with an approach like:
Maybe this can give you a rough guideline:
You need to parse the input text into prosodic units, as I doubt that the rules for "a/an" apply over prosodic boundaries (e.g. "We have found a (obviously not optimal) solution." vs. "We have found an obvious solution").
Next you need to parse each prosodic unit into phonological words.
Now you somehow need to identify those words, which represent the undefined article ("a house" vs "grade A product").
Once you have identified the articles, look at the next word in your prosodic unit and determine (here be dragons) the syllabic feature of the first phoneme of this word.
If it has [+syll] the article should be "an". If it has [-syll] the article should be "a". If the article is at the end of the prosodic unit, it should be maybe "a" (But what about ellipses: "Wait, I will give you an... -- he shouted, but dropped dead before he could utter the last word."). Except historical exceptions as mentioned by abanert, dialectal variance, etc, etc.
If the found article doesn't match the expected, mark this as "incorrect".
Here some pseudocode:
that should get you started , however it is not a complete solution....
Here's a solution where correctness is defined as:
an
comes before a word that starts with a vowel sound, otherwisea
may be used:Example input (one sentence per line)
Output
It is not obvious why the last pair is invalid, see Why is it “an yearly”?