I'm very new to scikit learn and machine learning in general.
I am currently designing a SVM to predict if a specific amino acid sequence will be cut by a protease. So far the the SVM method seems to be working quite well:
I'd like to visualize the distance between the two categories (cut and uncut), so I'm trying to use the linear discrimination analysis, which is similar to the principal component analysis, using the following code:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
targs = np.array([1 if _ else 0 for _ in XOR_list])
DATA = np.array(data_list)
X_r2 = lda.fit(DATA, targs).transform(DATA)
plt.figure()
for c, i, target_name in zip("rg", [1, 0],["Cleaved","Not Cleaved"]):
plt.scatter(X_r2[targs == i], X_r2[targs == i], c=c, label=target_name)
plt.legend()
plt.title('LDA of cleavage_site dataset')
However, the LDA is only giving a 1D result
In: print X_r2[:5]
Out: [[ 6.74369996]
[ 4.14254941]
[ 5.19537896]
[ 7.00884032]
[ 3.54707676]]
However, the pca analysis will give 2 dimensions with the data I am inputting:
pca = PCA(n_components=2)
X_r = pca.fit(DATA).transform(DATA)
print X_r[:5]
Out: [[ 0.05474151 0.38401203]
[ 0.39244191 0.74113729]
[-0.56785236 -0.30109694]
[-0.55633116 -0.30267444]
[ 0.41311866 -0.25501662]]
edit: here is a link to two google-docs with the input data. I am not using the sequence information, just the numerical information that follows. The files are split up between positive and negative control data. Input data: file1 file2