I have the following:
from numpy import genfromtxt
seg_data1 = genfromtxt('./datasets/segmentation.all', delimiter=',', dtype="|S5")
seg_data2 = genfromtxt('./datasets/segmentation.all', delimiter=',', dtype=["|S5"] + ["float" for n in range(19)])
print seg_data1
print seg_data2
print seg_data1[:,0:1]
print seg_data2[:,0:1]
it turns out that seg_data1
and seg_data2
are not the same kind of structure. Here's what printed:
[['BRICK' '140.0' '125.0' ..., '7.777' '0.545' '-1.12']
['BRICK' '188.0' '133.0' ..., '8.444' '0.538' '-0.92']
['BRICK' '105.0' '139.0' ..., '7.555' '0.532' '-0.96']
...,
['CEMEN' '128.0' '161.0' ..., '10.88' '0.540' '-1.99']
['CEMEN' '150.0' '158.0' ..., '12.22' '0.503' '-1.94']
['CEMEN' '124.0' '162.0' ..., '14.55' '0.479' '-2.02']]
[ ('BRICK', 140.0, 125.0, 9.0, 0.0, 0.0, 0.2777779, 0.06296301, 0.66666675, 0.31111118, 6.185185, 7.3333335, 7.6666665, 3.5555556, 3.4444444, 4.4444447, -7.888889, 7.7777777, 0.5456349, -1.1218182)
('BRICK', 188.0, 133.0, 9.0, 0.0, 0.0, 0.33333334, 0.26666674, 0.5, 0.077777736, 6.6666665, 8.333334, 7.7777777, 3.8888888, 5.0, 3.3333333, -8.333333, 8.444445, 0.53858024, -0.92481726)
('BRICK', 105.0, 139.0, 9.0, 0.0, 0.0, 0.27777782, 0.107407436, 0.83333325, 0.52222216, 6.111111, 7.5555553, 7.2222223, 3.5555556, 4.3333335, 3.3333333, -7.6666665, 7.5555553, 0.5326279, -0.96594584)
...,
('CEMEN', 128.0, 161.0, 9.0, 0.0, 0.0, 0.55555534, 0.25185192, 0.77777785, 0.16296278, 7.148148, 5.5555553, 10.888889, 5.0, -4.7777777, 11.222222, -6.4444447, 10.888889, 0.5409177, -1.9963073)
('CEMEN', 150.0, 158.0, 9.0, 0.0, 0.0, 2.166667, 1.6333338, 1.388889, 0.41851807, 8.444445, 7.0, 12.222222, 6.111111, -4.3333335, 11.333333, -7.0, 12.222222, 0.50308645, -1.9434487)
('CEMEN', 124.0, 162.0, 9.0, 0.11111111, 0.0, 1.3888888, 1.1296295, 2.0, 0.8888891, 10.037037, 8.0, 14.555555, 7.5555553, -6.111111, 13.555555, -7.4444447, 14.555555, 0.4799313, -2.0293121)]
[['BRICK']
['BRICK']
['BRICK']
...,
['CEMEN']
['CEMEN']
['CEMEN']]
Traceback (most recent call last):
File "segmentationdata.py", line 14, in <module>
print seg_data2[:,0:1]
IndexError: too many indices for array
I'd rather have genfromtxt
return data in the form of seg_data1
, though I don't know of any built-in way to force seg_data2
to conform to that type. As far as I know there's no easy way to do:
seg_target1 = seg_data1[:,0:1]
seg_data1 = seg_data1[:,1:]
for seg_data2
. Now I could do data.astype(float)
but the point is, isn't that what genfromtxt
should have done to begin with when I gave it that dtype
array?
With
dtype="|S5"
you import all columns as strings (5 char). The result is a 2d array with rows likeWith
dtype=["|S5"] + ["float" for n in range(19)]
you specify the dtype for each column, the result is a structured array. It is 1d with 20 fields. You access the fields by name (look atset_data2.dtype
), not by column number.A element, or record, of this array is displayed as a tuple, and includes a string and 19 floats:
# the initial character column
Specifying
dtype=None
should produce the same thing, possibly with some integer columns instead of all floats.It is also possible to specify a
dtype
with 2 fields, one the string column, and the other the 19 floats. I'd have to check the docs and run a few test cases to be sure of the format.I think you read enough of
genfromtxt
docs to see that you could specify a compound dtype, but not enough to understand the results.=================
Example of importing csv with text and numbers:
default: all floats
automatic dtype selection - 4 fields
user specified field dtypes
Compound dtype, with column count for the numeric field (and correction to string column)
If you need to do math across the numeric fields, this last case (or something more elaborate) might be most convenient.
To generate something more complicated it may be best to develop the
dtype
in a separate expression (dtype syntax can be tricky)