I am trying to use Native Bayes Classifier in detecting fraud transactions. I have a sample data of around 5000 in an excel sheet, this is the data which I will use for training the classifier and i have test data of around 1000 on which I will apply test classifier.
Here my problem is, I dont know how to train the classifier. Do I need to transform my training data into some specific format before passing it into training classifier. How the training classifier will know which is my target value and which are its features.
Can someone please help me?
In order to test your data, you need to make sure your training set has some labels or has been divided into chunks based on some features that you used in your data collection set. I am unsure how you have organized your data, but you need to split your data set into chunks of similar features together.
Once you have created your splits based on your criteria, check the creation of your input data. You can verify files using:
Train your classifier using:
Test the classifier using:
NOTE: Please note that during data collection you need to make sure you assign weights for certain data values, if they exist. Also data cleaning has to be done for normalizing error during the experimental setup or data collection. You can use any multiplicative scatter correction techniques for your data set to correct it.
Firstly, have a file called
training-categories.txt
, that contains the categories for your classifier. You can use a simple text editor to do this.Now that we have a list of categories we’re interested in, run the
ExtractTrainingData
class using the category list.This command will read documents and search for matching categories in the category and source fields. When one of the categories listed in
training-categories.txt
is found in one of these documents, the terms will be extracted from term vectors stored in the title and description fields. These terms will be written to a file in thecategory-bayes-data
directory. There will be a single file for each category. Each is a plain text file that can be viewed with any text editor or display utility.The category name appears in the first column, while each of the terms that appear in the document is contained in the second column. The Mahout Bayes classifiers expect the input fields to be stemmed, so you will see this reflected in the test data. The
--tv
argument to theextractTraining
data command causes the stemmed terms from each document’s term vector to be used.When the
ExtractTrainingData
class has completed its run it will output a count of documents found in each category.