The Scenario
I've read a csv (which is \t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type
The Problem
So far as per tried references (below) I've failed to get the output as required. The two column's values I'm trying to fetch are in int64 / float64, as below
uid iid rat
0 196 242 3.000000
1 186 302 3.000000
2 22 377 1.000000
I'm intrested in only iid and rat for the moment, and to pass it to Kmeans.fit() method and that too not with EPSILON in it. I need it in following format
Expected format
[[242, 3.000000],
[302, 3.000000],
[22, 1.000000]]
Unsucessful Attempt
X = values[:, 1:2]
Y = values[:, 2:3]
someArray = np.array([X,Y])
print someArray
and doesn't farewell on execution
[[[ 2.42000000e+02]
[ 3.02000000e+02]
[ 3.77000000e+02]
...,
[ 1.35200000e+03]
[ 1.62600000e+03]
[ 1.65900000e+03]]
[[ 3.00000000e+00]
[ 3.00000000e+00]
[ 1.00000000e+00]
...,
[ 1.00000000e+00]
[ 1.00000000e+00]
[ 1.00000000e+00]]]
Unhelped references so far
- This one
- This two
- This three
- This four
EDIT 1
tried np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True)
and got this
[[ nan 1.96000000e+02 1.86000000e+02 ..., 4.79000000e+02
4.79000000e+02 4.79000000e+02]
[ nan 2.42000000e+02 3.02000000e+02 ..., 1.36000000e+03
1.39400000e+03 1.65200000e+03]
[ nan 3.00000000e+00 3.00000000e+00 ..., 2.00000000e+00
1.92803605e+00 1.00000000e+00]]
It seems you need read_csv
for DataFrame
first with filter only second and third column first and then convert to numpy array by values
:
import pandas as pd
from sklearn.cluster import KMeans
from pandas.compat import StringIO
temp=u"""col,iid,rat
4,1,0
5,2,4
6,3,3
7,4,1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), usecols = [1,2])
print (df)
iid rat
0 1 0
1 2 4
2 3 3
3 4 1
X = df.values
print (X)
[[1 0]
[2 4]
[3 3]
[4 1]]
kmeans = KMeans(n_clusters=2)
a = kmeans.fit(X)
print (a)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
Use label-based selection and the .values
attribute of the resulting pandas
objects, which will be some sort of numpy
array:
>>> df
uid iid rat
0 196 242 3.0
1 186 302 3.0
2 22 377 1.0
>>> df.loc[:,['iid','rat']]
iid rat
0 242 3.0
1 302 3.0
2 377 1.0
>>> df.loc[:,['iid','rat']].values
array([[ 242., 3.],
[ 302., 3.],
[ 377., 1.]])
Note, your integer column will get promoted to float.
Also note, this particular selection could be approached in different ways:
>>> df.iloc[:, 1:] # integer-position based
iid rat
0 242 3.0
1 302 3.0
2 377 1.0
>>> df[['iid','rat']] # plain indexing performs column-based selection
iid rat
0 242 3.0
1 302 3.0
2 377 1.0
I like label-based because it is more explicit.
Edit
The reason you aren't seeing commas is an artifact of how numpy arrays are printed:
>>> df[['iid','rat']].values
array([[ 242., 3.],
[ 302., 3.],
[ 377., 1.]])
>>> print(df[['iid','rat']].values)
[[ 242. 3.]
[ 302. 3.]
[ 377. 1.]]
And actually, it is the difference between the str
and repr
results of the numpy array:
>>> print(repr(df[['iid','rat']].values))
array([[ 242., 3.],
[ 302., 3.],
[ 377., 1.]])
>>> print(str(df[['iid','rat']].values))
[[ 242. 3.]
[ 302. 3.]
[ 377. 1.]]
Why don't you just import the 'csv' as a numpy array?
import numpy as np
def read_file( fname):
return np.genfromtxt( fname, delimiter="/t", comments="%", unpack=True)