Load svmlight format error

2020-05-06 12:28发布

When I try to use the svmlight python package with data I already converted to svmlight format I get an error. It should be pretty basic, I don't understand what's happening. Here's the code:

import svmlight
training_data = open('thedata', "w")
model=svmlight.learn(training_data, type='classification', verbosity=0)

I've also tried:

training_data = numpy.load('thedata')

and

training_data = __import__('thedata')

1条回答
Deceive 欺骗
2楼-- · 2020-05-06 12:42

One obvious problem is that you are truncating your data file when you open it because you are specifying write mode "w". This means that there will be no data to read.

Anyway, you don't need to read the file like that if your data file is like the one in this example, you need to import it because it is a python file. This should work:

import svmlight
from data import train0 as training_data    # assuming your data file is named data.py
# or you could use __import__()
#training_data = __import__('data').train0

model = svmlight.learn(training_data, type='classification', verbosity=0)

You might want to compare your data against that of the example.

Edit after data file format clarified

The input file needs to be parsed into a list of tuples like this:

[(target, [(feature_1, value_1), (feature_2, value_2), ... (feature_n, value_n)]),
 (target, [(feature_1, value_1), (feature_2, value_2), ... (feature_n, value_n)]),
 ...
]

The svmlight package does not appear to support reading from a file in the SVM file format, and there aren't any parsing functions, so it will have to be implemented in Python. SVM files look like this:

<target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>

so here is a parser that converts from the file format to that required by the svmlight package:

def svm_parse(filename):

    def _convert(t):
        """Convert feature and value to appropriate types"""
        return (int(t[0]), float(t[1]))

    with open(filename) as f:
        for line in f:
            line = line.strip()
            if not line.startswith('#'):
                line = line.split('#')[0].strip() # remove any trailing comment
                data = line.split()
                target = float(data[0])
                features = [_convert(feature.split(':')) for feature in data[1:]]
                yield (target, features)

And you can use it like this:

import svmlight

training_data = list(svm_parse('thedata'))
model=svmlight.learn(training_data, type='classification', verbosity=0)
查看更多
登录 后发表回答