How to handle slang words and short forms in Tweet

2019-04-02 01:45发布

I am doing preprocessing of tweets using Python. However, a lot of words used are short forms of other words like luv, kool etc. And also, abbreviations like brb , ttyl etc.

Right now, I can only think of having a huge Hashmap with words as keys and the actual words or expansions as values. Is there any other better way to approach this using NLP ?

NOTE : I know question seems too vague. But please dont report it. I have asked this so that amateurs can benefit from this knowledge

PS : Is there a nicely formatted text list that I can download and use? The links put down are good , but when i copy and paste it - they are not in an easily parsable format

标签: twitter nlp
1条回答
Bombasti
2楼-- · 2019-04-02 02:00

The only way to decipher abbreviations is to use external resources. That is why there are many dictionaries of abbreviations for humans. Although, humans can predict meaning by using common-sense knowledge and already known abbreviation, but even they do it badly, so no hope for NLP at this time.

Sometimes it is also possible to find definitions of abbreviations in the same text, but it is not the case for twitter or (not and) slang.

So, yes, you have to store mapping from acronyms to their expansions. In order to obtain them, search for acronyms dictionary, e.g. this slang dictionary, or that, or that, or that - seems to be the easiest for parsing.

As for other slang like 'kool', you can try spell correction algorithms, see related question.

查看更多
登录 后发表回答