Python file format for email classification with s

2019-08-16 08:31发布

问题:

I am working with email subject, so I have 20 emails i want to classify, and a file with 20 lines - one line has one email subject.I have been working on it, but I am unable to figure out what the features refer to and the format of the input file for svmlight. Any tips to proceed will be helpful. Thanks in advance!

Edit: I have taken the tf-idf of the first 500 subject lines as a trial. However, according to svm-light format, we need:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>

I have only the tf-idf features for 500 lines. Sadly, this is not read by the svm-light as it needs features/value pair. Any ideas on what the value could be or how I can change the file in order to be read?

An idea of the file I have(first 5 email features):

1 201 1.0
2 280 0.123165672613
2 313 0.343915400191
2 515 0.157569797284
2 588 0.343915400191
2 652 0.343915400191
2 657 0.343915400191
2 774 0.23622904941
2 921 0.283118375032
2 1158 0.254849368195
2 1240 0.343915400191
2 1348 0.343915400191
2 1362 0.222321349873
3 57 0.342220321154
3 185 0.391349077827
3 244 0.391349077827
3 300 0.391349077827
3 693 0.391349077827
3 730 0.342220321154
3 1391 0.391349077827
4 57 0.342220321154
4 185 0.391349077827
4 244 0.391349077827
4 300 0.391349077827
4 693 0.391349077827
4 730 0.342220321154
4 1391 0.391349077827
5 32 0.323558487577
5 102 0.323558487577
5 157 0.364177022553
5 160 0.364177022553
5 718 0.151013895297
5 1171 0.364177022553
5 1277 0.323558487577
5 1308 0.364177022553
5 1336 0.364177022553

Please help!

回答1:

If you make a feature out of each word, create a list of all unique words w(1)..w(n). Now feature(i) gets the value 1 if w(i) exists in the sample you are examining. (You could also make the value be equal to the number of occurrences, so that a feature which occurs multiple times gets more weight.)

Assuming the following samples;

1 My hovercraft is full of eels
2 Your account is suspended
3 This is it!

... you could extract the following dictionary;

001 My
002 hovercraft
003 is
 :
 :
009 suspended
010 This
011 it!

(The leading zeros are just to make the features look different than the other numbers in this exposition. Normally there should probably not be any leading zeros.)

The features for sample 1 are 001 through 006; for sample 3 they are 010, 003, and 011. The other features get the value 0. So the full representation of sample 3 would look like

3 001:0 002:0 003:1 004:0 005:0 ...

(though I don't think you need to specify the zero, i.e. absent, features).

However, given the small sample size (just subjects), it's unlikely that you get very good results. Perhaps you'd be better off using e.g. bigram or trigram features (split each word using a sliding window; tri, rig, igr, gra, ram).

I don't think it makes sense to try to mix tf-idf with SVM, they are different approaches to the same fundamental problem.