numpy.genfromtxt with datetime.strptime converter

2019-02-22 01:23发布

问题:

I have data similar to that seen in this gist and I am trying to extract the data with numpy. I am rather new to python so I tried to do so with the following code

import numpy as np
from datetime import datetime

convertfunc = lambda x: datetime.strptime(x, '%H:%M:%S:.%f')
col_headers = ["Mass", "Thermocouple", "T O2 Sensor",\
               "Igniter", "Lamps", "O2", "Time"]
data = np.genfromtxt(files[1], skip_header=22,\
                     names=col_headers,\
                     converters={"Time": convertfunc})

Where as can be seen in the gist there are 22 rows of header material. In Ipython, when I "run" the following code I receive an error that ends with the following:

TypeError: float() argument must be a string or a number

The full ipython error trace can be seen here.

I am able to extract the six columns of numeric data just fine using an argument to genfromtxt like usecols=range(0,6), but when I try to use a converter to try and tackle the last column I'm stumped. Any and all comments would be appreciated!

回答1:

This is happening because np.genfromtxt is trying to create a float array, which fails because convertfunc returns a datetime object, which cannot be cast as float. The easiest solution would be to just pass the argument dtype='object' to np.genfromtxt, ensuring the creation of an object array and preventing a conversion to float. However, this would mean that the other columns would be saved as strings. To get them properly saved as floats you need to specify the dtype of each to get a structured array. Here I'm setting them all to double except the last column, which will be an object dtype:

dd = [(a, 'd') for a in col_headers[:-1]] + [(col_headers[-1], 'object')]
data = np.genfromtxt(files[1], skip_header=22, dtype=dd, 
                     names=col_headers, converters={'Time': convertfunc})

This will give you a structured array which you can access with the names you gave:

In [74]: data['Mass']
Out[74]: array([ 0.262 ,  0.2618,  0.2616,  0.2614])
In [75]: data['Time']
Out[75]: array([1900-01-01 15:49:24.546000, 1900-01-01 15:49:25.171000,
                1900-01-01 15:49:25.405000, 1900-01-01 15:49:25.624000], 
                dtype=object)


回答2:

You can use pandas read_table:

    import pandas as pd
    frame=pd.read_table('/tmp/gist', header=None, skiprows=22,delimiter='\s+') 

worked for me. You need to process the header separately since they are variable number of space separated.