Query about SVM mapping of input vector? And SVM o

2019-07-23 16:28发布

I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.

So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?

Second question is about the following optimization equation:

min 1/2 wi(transpose)*wi + C Σi = 1..n ξi

so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.

A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.

2条回答
虎瘦雄心在
2楼-- · 2019-07-23 17:10

1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.

The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.

2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.

See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).

查看更多
够拽才男人
3楼-- · 2019-07-23 17:12

When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.

To be concrete, let's talk about the kernel

K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,

where x and y are each real numbers (one-dimensional). Note that

(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1

has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.

The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.

You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)


The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).

查看更多
登录 后发表回答