The English language has a couple of contractions. For instance:
you've -> you have
he's -> he is
These can sometimes cause headache when you are doing natural language processing. Is there a Python library, which can expand these contractions?
The English language has a couple of contractions. For instance:
you've -> you have
he's -> he is
These can sometimes cause headache when you are doing natural language processing. Is there a Python library, which can expand these contractions?
I made that wikipedia contraction-to-expansion page into a python dictionary (see below)
Note, as you might expect, that you definitely want to use double quotes when querying the dictionary:
Also, I've left multiple options in as in the wikipedia page. Feel free to modify it as you wish. Note that disambiguation to the right expansion would be a tricky problem!
The answers above will work perfectly well and could be better for ambiguous contraction (although I would argue that there aren't that many ambiguous cases). I would use something more readable and easier to maintain:
It might have some flaws I didn't think about though.
Reposted from my other answer
This is a very cool and easy to use library for the purpose https://pypi.python.org/pypi/pycontractions/1.0.1.
Example of use (detailed in link):
You will also need GoogleNews-vectors-negative300.bin, link to download in the pycontractions link above. *Example code in python3.
Even though this is an old question, I figured I might as well answer since there is still no real solution to this as far as I can see.
I have had to work on this on a related NLP project and I decided to tackle the problem since there didn't seem to be anything here. You can check my expander github repository if you are interested.
It's a fairly badly optimized (I think) program based on NLTK, the Stanford Core NLP models, which you will have to download separately, and the dictionary in the previous answer. All the necessary information should be in the README and the lavishly commented code. I know commented code is dead code, but this is just how I write to keep things clear for myself.
The example input in
expander.py
are the following sentences:To which the output is
So for this small set of test sentences, I came up with to test some edge-cases, it works well.
Since this project has lost importance right now, I am not actively developing this anymore. Any help on this project would be appreciated. Things to be done are written in the TODO list. Or if you have any tips on how to improve my python I would also be very thankful.
You don't need a library, it is possible to do with reg exp for example.
I would like to add little to alko's answer here. If you check wikipedia, the number of English Language contractions as mentioned there are less than 100. Granted, in real scenario this number could be more than that. But still, I am pretty sure that 200-300 words are all you will have for English contraction words. Now, do you want to get a separate library for those (I don't think what you are looking for actually exists, though)?. However, you can easily solve this problem with dictionary and using regex. I would recommend using a nice tokenizer asNatural Language Toolkit and the rest you should have no problem in implementing yourself.