I have the following:
from numpy import genfromtxt
seg_data1 = genfromtxt('./datasets/segmentation.all', delimiter=',', dtype="|S5")
seg_data2 = genfromtxt('./datasets/segmentation.all', delimiter=',', dtype=["|S5"] + ["float" for n in range(19)])
print seg_data1
print seg_data2
print seg_data1[:,0:1]
print seg_data2[:,0:1]
it turns out that seg_data1
and seg_data2
are not the same kind of structure. Here's what printed:
[['BRICK' '140.0' '125.0' ..., '7.777' '0.545' '-1.12']
['BRICK' '188.0' '133.0' ..., '8.444' '0.538' '-0.92']
['BRICK' '105.0' '139.0' ..., '7.555' '0.532' '-0.96']
...,
['CEMEN' '128.0' '161.0' ..., '10.88' '0.540' '-1.99']
['CEMEN' '150.0' '158.0' ..., '12.22' '0.503' '-1.94']
['CEMEN' '124.0' '162.0' ..., '14.55' '0.479' '-2.02']]
[ ('BRICK', 140.0, 125.0, 9.0, 0.0, 0.0, 0.2777779, 0.06296301, 0.66666675, 0.31111118, 6.185185, 7.3333335, 7.6666665, 3.5555556, 3.4444444, 4.4444447, -7.888889, 7.7777777, 0.5456349, -1.1218182)
('BRICK', 188.0, 133.0, 9.0, 0.0, 0.0, 0.33333334, 0.26666674, 0.5, 0.077777736, 6.6666665, 8.333334, 7.7777777, 3.8888888, 5.0, 3.3333333, -8.333333, 8.444445, 0.53858024, -0.92481726)
('BRICK', 105.0, 139.0, 9.0, 0.0, 0.0, 0.27777782, 0.107407436, 0.83333325, 0.52222216, 6.111111, 7.5555553, 7.2222223, 3.5555556, 4.3333335, 3.3333333, -7.6666665, 7.5555553, 0.5326279, -0.96594584)
...,
('CEMEN', 128.0, 161.0, 9.0, 0.0, 0.0, 0.55555534, 0.25185192, 0.77777785, 0.16296278, 7.148148, 5.5555553, 10.888889, 5.0, -4.7777777, 11.222222, -6.4444447, 10.888889, 0.5409177, -1.9963073)
('CEMEN', 150.0, 158.0, 9.0, 0.0, 0.0, 2.166667, 1.6333338, 1.388889, 0.41851807, 8.444445, 7.0, 12.222222, 6.111111, -4.3333335, 11.333333, -7.0, 12.222222, 0.50308645, -1.9434487)
('CEMEN', 124.0, 162.0, 9.0, 0.11111111, 0.0, 1.3888888, 1.1296295, 2.0, 0.8888891, 10.037037, 8.0, 14.555555, 7.5555553, -6.111111, 13.555555, -7.4444447, 14.555555, 0.4799313, -2.0293121)]
[['BRICK']
['BRICK']
['BRICK']
...,
['CEMEN']
['CEMEN']
['CEMEN']]
Traceback (most recent call last):
File "segmentationdata.py", line 14, in <module>
print seg_data2[:,0:1]
IndexError: too many indices for array
I'd rather have genfromtxt
return data in the form of seg_data1
, though I don't know of any built-in way to force seg_data2
to conform to that type. As far as I know there's no easy way to do:
seg_target1 = seg_data1[:,0:1]
seg_data1 = seg_data1[:,1:]
for seg_data2
. Now I could do data.astype(float)
but the point is, isn't that what genfromtxt
should have done to begin with when I gave it that dtype
array?
With dtype="|S5"
you import all columns as strings (5 char). The result is a 2d array with rows like
['BRICK' '140.0' '125.0' ..., '7.777' '0.545' '-1.12']
With dtype=["|S5"] + ["float" for n in range(19)]
you specify the dtype for each column, the result is a structured array. It is 1d with 20 fields. You access the fields by name (look at set_data2.dtype
), not by column number.
A element, or record, of this array is displayed as a tuple, and includes a string and 19 floats:
('BRICK', 140.0, 125.0, 9.0, 0.0, 0.0, 0.2777779, 0.06296301, 0.66666675, 0.31111118, 6.185185, 7.3333335, 7.6666665, 3.5555556, 3.4444444, 4.4444447, -7.888889, 7.7777777, 0.5456349, -1.1218182)
# the initial character column
print set_data2['f0']
Specifying dtype=None
should produce the same thing, possibly with some integer columns instead of all floats.
It is also possible to specify a dtype
with 2 fields, one the string column, and the other the 19 floats. I'd have to check the docs and run a few test cases to be sure of the format.
I think you read enough of genfromtxt
docs to see that you could specify a compound dtype, but not enough to understand the results.
=================
Example of importing csv with text and numbers:
In [139]: txt=b"""one 1 2 3
...: two 4 5 6
...: """
default: all floats
In [140]: np.genfromtxt(txt.splitlines())
Out[140]:
array([[ nan, 1., 2., 3.],
[ nan, 4., 5., 6.]])
automatic dtype selection - 4 fields
In [141]: np.genfromtxt(txt.splitlines(),dtype=None)
Out[141]:
array([(b'one', 1, 2, 3), (b'two', 4, 5, 6)],
dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
user specified field dtypes
In [142]: np.genfromtxt(txt.splitlines(),dtype='str,int,float,int')
Out[142]:
array([('', 1, 2.0, 3), ('', 4, 5.0, 6)],
dtype=[('f0', '<U'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])
Compound dtype, with column count for the numeric field (and correction to string column)
In [145]: np.genfromtxt(txt.splitlines(),dtype='S5,(3)int')
Out[145]:
array([(b'one', [1, 2, 3]), (b'two', [4, 5, 6])],
dtype=[('f0', 'S5'), ('f1', '<i4', (3,))])
In [146]: _['f0']
Out[146]:
array([b'one', b'two'],
dtype='|S5')
In [149]: _['f1']
Out[149]:
array([[1, 2, 3],
[4, 5, 6]])
If you need to do math across the numeric fields, this last case (or something more elaborate) might be most convenient.
To generate something more complicated it may be best to develop the dtype
in a separate expression (dtype syntax can be tricky)
In [172]: dt=np.dtype([('f0','|S5'),('f1',[('f10',int),('f11',float,(2))])])
In [173]: np.genfromtxt(txt.splitlines(),dtype=dt)
Out[173]:
array([(b'one', (1, [2.0, 3.0])), (b'two', (4, [5.0, 6.0]))],
dtype=[('f0', 'S5'), ('f1', [('f10', '<i4'), ('f11', '<f8', (2,))])])