Unable to use Pandas and NLTK to train Naive Bayes

2019-06-10 05:46发布

问题:

Here is what I am trying to do. I have a csv. file with column 1 with people's names (ie: "Michael Jordan", "Anderson Silva", "Muhammad Ali") and column 2 with people's ethnicity (ie: English, French, Chinese).

In my code, I create the pandas data frame using all the data. Then create additional data frames: one with only Chinese names and another one with only non-Chinese names. And then I create separate lists.

The three_split function extracts the feature of each name by splitting them into three-character substrings. For example, "Katy Perry" into "kat", "aty", "ty ", "y p" ... etc.

Then I train with Naive Bayes and finally test the results.

There isn't any errors when running my codes, but when I try to use the non-Chinese names directly from the database and expect the program to return False (not Chinese), it returns True (Chinese) for any name I test. Any idea?

import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.classify import PositiveNaiveBayesClassifier

# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Dropbox\Python exercises_KW\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\OddNames_sampleData3.csv", 
    encoding="utf-8")
df = DataFrame(data)
df.columns = ["name", "ethnicity"]

# Recategorize different ethnicities into 1) Chinese or 2) non-Chinese; and then create separate lists
df_chinese = df[(df["ethnicity"] == "chinese") | (df["ethnicity"] == "Chinese")]
chinese_names = list(df_chinese["name"])

df_nonchinese = df[(df["ethnicity"] != "chinese") & (df["ethnicity"] != "Chinese") & (df["ethnicity"].notnull() == True)]
nonchinese_names = list(df_nonchinese["name"])

# Function to split word string into three-character substrings
def three_split(word):
    word = str(word).lower().replace(" ", "_")
    split = 3
    return dict(("contains(%s)" % word[start:start+split], True) 
        for start in range(0, len(word)-2))

# Training naive bayes machine learning algorithm
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)

# Testing results
name = "Hubert Gillies" # A non-Chinese name from the dataset
print classifier.classify(three_split(name))
>>> True # Wrong output

回答1:

There could be many problems when it comes why you don't get the desired results, most often it's either:

  • Features are not strong enough
  • Not enough training data
  • Wrong classifier
  • Code bugs in NLTK classifiers

For the first 3 reasons, there's no way to verify/resolve unless you post a link to your dataset and we take a look at how to fix it. As for the last reason, there shouldn't be one for the basic NaiveBayes and PositiveNaiveBayes classifier.

So the question to ask is:

  • How many training data instances (i.e. rows) do you have?
  • Why didn't you normalize your labels (i.e. chinese|Chinese -> chinese) after you've read the dataset before extracting the features?
  • What other features to consider?
  • Have you considered using NaiveBayes instead of PositiveNaiveBayes?