Unable to use Pandas and NLTK to train Naive Bayes

Here is what I am trying to do. I have a csv. file with column 1 with people's names (ie: "Michael Jordan", "Anderson Silva", "Muhammad Ali") and column 2 with people's ethnicity (ie: English, French, Chinese).

In my code, I create the pandas data frame using all the data. Then create additional data frames: one with only Chinese names and another one with only non-Chinese names. And then I create separate lists.

The three_split function extracts the feature of each name by splitting them into three-character substrings. For example, "Katy Perry" into "kat", "aty", "ty ", "y p" ... etc.

Then I train with Naive Bayes and finally test the results.

There isn't any errors when running my codes, but when I try to use the non-Chinese names directly from the database and expect the program to return False (not Chinese), it returns True (Chinese) for any name I test. Any idea?

import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.classify import PositiveNaiveBayesClassifier

# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Dropbox\Python exercises_KW\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\OddNames_sampleData3.csv", 
    encoding="utf-8")
df = DataFrame(data)
df.columns = ["name", "ethnicity"]

# Recategorize different ethnicities into 1) Chinese or 2) non-Chinese; and then create separate lists
df_chinese = df[(df["ethnicity"] == "chinese") | (df["ethnicity"] == "Chinese")]
chinese_names = list(df_chinese["name"])

df_nonchinese = df[(df["ethnicity"] != "chinese") & (df["ethnicity"] != "Chinese") & (df["ethnicity"].notnull() == True)]
nonchinese_names = list(df_nonchinese["name"])

# Function to split word string into three-character substrings
def three_split(word):
    word = str(word).lower().replace(" ", "_")
    split = 3
    return dict(("contains(%s)" % word[start:start+split], True) 
        for start in range(0, len(word)-2))

# Training naive bayes machine learning algorithm
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)

# Testing results
name = "Hubert Gillies" # A non-Chinese name from the dataset
print classifier.classify(three_split(name))
>>> True # Wrong output

标签： python-2.7 pandas machine-learning nltk naivebayes

1条回答

对你真心纯属浪费

2楼-- · 2019-06-10 06:08

There could be many problems when it comes why you don't get the desired results, most often it's either:

Features are not strong enough
Not enough training data
Wrong classifier
Code bugs in NLTK classifiers

For the first 3 reasons, there's no way to verify/resolve unless you post a link to your dataset and we take a look at how to fix it. As for the last reason, there shouldn't be one for the basic NaiveBayes and PositiveNaiveBayes classifier.

So the question to ask is:

How many training data instances (i.e. rows) do you have?
Why didn't you normalize your labels (i.e. chinese|Chinese -> chinese) after you've read the dataset before extracting the features?
What other features to consider?
Have you considered using NaiveBayes instead of PositiveNaiveBayes?

0人赞添加讨论(0) 举报

Unable to use Pandas and NLTK to train Naive Bayes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间