I have a collection of text which has sentences either entirely in English or Hindi or Marathi with ids attached to each of these sentences as 0,1,2 respectively representing the language of the text.
The text irrespective of any language may have HTML tags, punctuation etc.
I could clean the English sentences using my code below:
import HTMLParser
import re
from nltk.corpus import stopwords
from collections import Counter
import pickle
from string import punctuation
#creating html_parser object
html_parser = HTMLParser.HTMLParser()
cachedStopWords = set(stopwords.words("english"))
def cleanText(text,lang_id):
if lang_id == 0:
str1 = ''.join(text).decode('iso-8859-1')
else:
str1 = ''.join(text).encode('utf-8')
str1 = html_parser.unescape(str1)
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', str1)
#print "cleantext before puncts removed : " + cleantext
clean_puncts = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
cleantext = re.sub(clean_puncts,' ',cleantext)
#print " cleantext after puncts removed : " + cleantext
cleanest = cleantext.lower()
if lang_id == 0:
cleanertext = ' '.join([word for word in cleanest.split() if word not in cachedStopWords])
words = re.findall(r"[\w']+", cleanertext)
words_final = [x.encode('UTF8') for x in words]
else:
words_final = cleanest.split()
return words_final
but it gives me the following error for Hindi and Marathi text as :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 104: ordinal not in range(128)
also, it removes all the words.
Hindi text is like
<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>
How can I do the same for Hindi or Marathi text?
Without the full textfile, the solution that we can provide will only be a shot in the dark.
Firstly, check the types of the strings you're reading into the
cleanText()
, it is really a unicode or is it a byte string? See byte string vs. unicode string. PythonSo if you've read your file properly and ensures that everything is unicode, there should be no problem in how you manage the strings (in both python2 or 3). The following example confirms this:
Even with regex manipulation, there's no problem:
Take a look at "How to stop the pain" and following the best-practices in this talk would most likely resolve your unicode problems. Slides on http://nedbatchelder.com/text/unipain.html .
Do also look at this too: https://www.youtube.com/watch?v=Mx70n1dL534 on PyCon14 (but only applicable for
python2.x
)Opening a utf8 file like this might resolve your problem too:
If the STDIN and STDOUT is giving you problem, see https://daveagp.wordpress.com/2010/10/26/what-a-character/
See also: