How to represent text for classification in weka?

Can you please let me know how to represent attribute or class for text classification in weka. By using what attribute can I do classification? word frequency or just word? What would be possible structure of ARFF format? Can you give me several lines of example of that structure?

Thank you very much in advance.

标签： java machine-learning classification weka arff

2条回答

何必那么认真

2楼-- · 2019-04-23 04:18

In weka, you can choose your own attribute. In this example, we only have 2 classes and all of the unique words are used as attributes. If you choose word frequency as your attribute, then you assign '2' if that word occurs twice in your text, and '0' if not, or '1' if that word occurs only once.

Here is the example .arff format.

@RELATION anyrelation

@ATTRIBUTE word1
@ATTRIBUTE word2
...
@ATTRIBUTE wordn
@ATTRIBUTE class {class1, class2}

@DATA
1,2,....,0,class1
0,3,....,1,class2

0人赞添加讨论(0) 举报

做个烂人

3楼-- · 2019-04-23 04:27

One of the easiest alternatives is to start with an ARFF file for a two class problem like:

@relation corpus 

@attribute text string
@attribute class {pos,neg}

@data
'long text with words ... ',pos

The text is represented as a String type and the class is a nominal with two values.

Then you could apply two filters:

StringToWordVector that transforms the texts into a word vector representation. The filter uses an attribute for each word. You can tweak parameters to choose binary/frequency representation, stemming or stopwords. The best representation depends on the problem. If text are not long, usually binary representation is enough.
Reorder to move the class atribute to the last position, Weka assumes it is there.

You may find more info and other approaches to transform your data in this Weka wiki page: http://weka.wikispaces.com/Text+categorization+with+WEKA

0人赞添加讨论(0) 举报

How to represent text for classification in weka?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间