I have a pandas dataframe which contains review texts. After text preprocessing I ended up with list of strings in each row. Now I want to iterate over each row of these lists of strings to check whether each string is in english or not. I want to count occurrences of non-english words to create another column "Occurrences".
For english language checking I will use pyenchant library.
Something similar to the code below
review_text sentiment error_related
0 [simple, effective, way, new, word, kid] 1 NaN
1 [fh, fcfatgv] 1 NaN
2 [son, loved, easy, even, though, son, first, g... 1 NaN
english_dict = enchant.Dict("en_US")
def enlgish_counter(df, df_text_column):
number_of_non_english_words = []
for review in df_text_column:
for word in review:
a=0
if english_dict.check(i)==False:
a=a+1
non_english_words.append(a)
You didn't include example data so I constructed it manually. Note, that my dataframe format can differ from yours.
import pandas as pd
import enchant
english_dict = enchant.Dict("en_US")
# Construct the dataframe
words = ['up and vote', 'wet 0001f914 turtle 0001f602', 'thumbnailшщуй',
'lobby', 'mods saffron deleted iâ', 'â', 'itâ donâ edit', 'thatâ',
'didnâ canâ youâ']
df = pd.DataFrame()
for word in words:
record = {'text': word}
df = df.append(record, ignore_index=True)
# Get texts column
for text in df['text']:
# Counters
eng_words = 0
non_eng_words = 0
# For every word in text
for word in text.split(' '):
# Check if it is english
if english_dict.check(word) == True:
eng_words += 1
else:
non_eng_words += 1
# Print the result
# NOTE that these results are discarded each new text
print('EN: {}; NON-EN: {}'.format(eng_words, non_eng_words))
If you want to modify your dataset, you should wrap this code into a function:
def create_occurences(df):
eng_words_list = []
non_eng_words_list = []
for text in df['text']:
eng_words = 0
non_eng_words = 0
for word in text.split(' '):
if english_dict.check(word) == True:
eng_words += 1
else:
non_eng_words += 1
eng_words_list.append(eng_words)
non_eng_words_list.append(non_eng_words)
df['eng_words'] = pd.Series(eng_words_list, index=df.index)
df['non_eng_words'] = pd.Series(non_eng_words_list, index=df.index)
create_occurences(df)
df