Classification of Electrical Signals using SVM

I am trying to map electrical signals (specifically EEG signals) to actions. I have the raw data from from the eeg device it has 14 channels so for each training data instance I end up with a 14x128 matrix. (14 channels 128 samples (1 sec window)). Currently what I do is apply hamming window on each channel then apply fft to classify using frequency. What I can not wrap my head around is SVM (or other classification algorithms) expects a matrix of the following form

but in the case of EEG each channel is the feature but instead of having single values each channel has vector of 128 values. what would be the best way to transform this matrix into a form that svm can understand? Say do I just modify the 14x128 matrices add new col class and append them one after the other. So for a 1 sec record of the eeg signal I end up with 128 pos/neg classes?

You almost certainly need some feature extraction prior to handing the raw data to the SVM. With temporal data like this, the important features are generally not represented well by individual point readings. Rather, they are captured by relationships over time.

I did some work about 10 years ago with SVMs on EEG data[1], and what we did at the time was split the data into windows, but then build autoregression models of each window. Our features for the classifiers were not the raw sensor readings, but the AR coefficients for each channel. This gives you much more useful information for the classifier to use.

I haven't kept working in that area, and I can't say for sure what people are doing now 10+ years later, but certainly I would expect the state of the art to still involve some sort of feature extraction.

[1] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1214704 (pdf available from my personal page http://www.ru.is/kennarar/deong/pubs/ieee_eeg_final.pdf)

Edit: In light of the discussion in the comments, I'm editing the answer to provide a bit more detail. Signal processing is not my strongest area, so if I'm completely mistaking your description of what it is you're doing, feel free to ignore.

Yes, the answer to the question you asked is that when you have multiple channels of data and so your instance is a matrix, you just concatenate the rows into a row vector. So if for each training instance, you're getting a 14x128 matrix, you'd just convert that into a 1x1792 vector and then stick the class label on the end. Like

c1x1 | c1x2 | c1x3 | ... | c1x128 | c2x1 | c2x2 | ... | c14x127 | c14x128 | class

where cNxM = channel N, sample M. That would be the standard way to make a single feature vector out of a sort of feature matrix.

However...read on to see why I think this is not what you really want to do.

I'm still not clear what it is you're describing. In particular, where does the 128 come from? I see two possibilities here. (A) is that you sample each of the 14 electrodes 128 times for each item you want to classify. This is what I'm calling the raw data. (B) is that you've already run the DFT and you've ended up with 128 coefficients per channel. I think (A) is what you mean, and that's what I assume here, but it's not entirely clear.

For classification, you need meaningful features. Features are just whatever you decide to make them. You could take each of the 14 sensors, compute the mean and variance of the 128 points, and use those as your features. In that case, your training instances would look like

mean_ch1 | var_ch1 | mean_ch2 | var_ch2 | ... | mean_ch14 | var_ch14 | class

For EEG classification, mean and variance aren't going to be very good though -- they're not likely to provide enough useful information to discriminate between the classes. That's what I mean by meaningful features. If you want to predict whether, for example, an invasive species will thrive in a lake, you might need to know the temperature. You could then pass the classifier the estimated velocity of every water molecule in the lake separately, but that's entirely the wrong level of detail, and it's really unlikely the classifier would learn anything. You need to give it the temperature already computed.

So in your case, you could instead take an FFT of each window of 128 points. That would give you some small number of non-zero coefficients per channel. Your training data would then look like

dft_coeff1_ch1 | cft_coeff2_ch1 | dft_coeff3_ch1 | dft_coeff1_ch2 | dft_coeff2_ch2 | ... | class

You could also just dump the 128 values per channel into the feature vector unmodified, giving you 14*128=1792 features per input, but those features are probably terribly unhelpful -- you're giving it the velocities of molecules rather than the temperature again. In principle, most learning algorithms would be capable of learning the target concept, but the requirements on the amount of training data and time needed may be vast.

Features should capture the level of detail the classifier can use. For most time series data, that usually means high-level conceptual things like "sloping upward", "V-shaped", "flat for a while, then decreasing", "oscillating at these frequencies", etc. Whatever you as a human think might be relevant. This is really the reason to use something like a Fourier transform -- the frequency domain gives you a much higher level, and probably more useful, description of the signal with many fewer degrees of freedom than the time domain.