Coding variables with Pandas TimeSeries

2019-08-31 12:02发布

问题:

As a follow up to something I was struggling with in a previous question, I've been working for a long time on an analysis of some pretty complicated behavioural data from a mouse-tracking experiment in Pandas.

A relevant subset of my data looks like this:

data.iloc[0]

time_stamp                                     21/11/2013 13:06
subject                                                 1276270
trial                                                         0
stimuli                                                      14
resp                                                          2
rt                                                         1145
x             [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0....
y             [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
t             [1, 26, 26, 35, 45, 55, 65, 75, 85, 95, 105, 1...
Name: 0, dtype: object

where, x, y, and t are 1D numpy arrays of mouse coordinates and timestamps.

I wanted to use Pandas' considerable resources for time series data to transform and analyse these coordinates as TimeSeries objects. I have no problem converting them to TimeSeries objects (rx and ry, each with indexes generated by interpolating the timestamps into 20 msec intervals.

data.rx.iloc[0]

0     -0
20     0
40     0
60     0
80     0
100    0
120    0
140    0
160    0
180    0
200    0
220    0
240    0
260    0
280    0
...
2720    1
2740    1
2760    1
2780    1
2800    1
2820    1
2840    1
2860    1
2880    1
2900    1
2920    1
2940    1
2960    1
2980    1
3000    1
Length: 151, dtype: float64

However, this approach, with 2 TimeSeries nested on each row of the DataFrame, definitely isn't idiomatic (see this question); although I have been able to do quite a bit with it, I feel I'm going against Pandas, and making life difficult for myself.

The proper approach, I think, would be to either store rx and ry as independent data structures, or add 302 columns to my existing data, one for each time step in rx and ry.

The problem with the first approach is that I have no way of accessing my categorical data (i.e. the subject, stimuli, and resp columns, amongst others I've left out here), while the problem with the second is that I end up with a DataFrame thousands of columns wide (and wider again for each transformation I apply: velocity at each step, angle at each step, etc), and no useful way of accessing specific time serieses (i.e. what I've been currently calling as data.rx.mean().plot().

All of this is really just preamble to my question, which is this:

Does Pandas, or any other python library, provide a way of processing a large number of time series data, while preserving the coding data that accompanies them?

Thanks,

Eoin

回答1:

I've been asked via email if I ever found a solution to what I wanted to do here, so I'm sharing what I've been doing to date. This might not be the canonical way of using pandas, but it's sufficed for me.

In short, I've split my data into a couple of data frames. The first, data, is as above, but I only use the columns which correspond to single values, like trial, stimuli, resp, and rt.

For my time series data, I use two additional data frames, one for the x-coordinate data, and one for the y. Although there's probably a more elegant way of generating these, my code does the following.

data.iloc[0]

    time_stamp                                     21/11/2013 13:06
    subject                                                 1276270
    trial                                                         0
    stimuli                                                      14
    resp                                                          2
    rt                                                         1145
    x             [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0....
    y             [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
    t             [1, 26, 26, 35, 45, 55, 65, 75, 85, 95, 105, 1...
    Name: 0, dtype: object

data['nx'], data['ny'] = zip(*
     [even_time_steps(x, y, t)
     for x, y, t, in zip(data.x, data.y, data.t)])
     # Using function even_time_steps from package squeak
     # https://github.com/EoinTravers/Squeak 
     # Simpler applications could use
     # data['nx'] = [pd.TimeSeries(x) for y in data['x']]
     # data['ny'] = [pd.TimeSeries(x) for y in data['y']]

# Seperate DataFrames
nx = pd.concat(list(data.nx), axis=1).T
ny = pd.concat(list(data.ny), axis=1).T

# Remove redundant columns
redundant = ['nx', 'ny', 'x', 'y'] # etc...
data = data.drop(redundant, axis=1)

# Important - reindex data
data.index = range(len(data)) # 0, 1, 2, ..., len(data)

Now data contains all my coding information, nx all my x-coordinate information, and ny my y coordinate information.

nx.head()

       0    1    2    3    4    5    6    7    8        9     ...          91
    0    0    0    0    0    0    0    0    0    0  0.00000   ...     0.953960   
    1    0    0    0    0    0    0    0    0    0  0.00099   ...     1.000000   
    2    0    0    0    0    0    0    0    0    0  0.00000   ...     1.010000   
    3    0    0    0    0    0    0    0    0    0  0.00000   ...     0.870396   
    4    0    0    0    0    0    0    0    0    0  0.00000   ...     1.000000   

             92        93        94       95        96   97   98   99   100  
    0  0.993564  1.000000  1.000000  1.00000  1.000000    1    1    1    1  
    1  1.000000  1.000000  1.000000  1.00000  1.000000    1    1    1    1  
    2  1.010000  1.008812  1.003960  1.00000  1.000000    1    1    1    1  
    3  0.906238  0.936931  0.973564  0.98604  0.993366    1    1    1    1  
    4  1.000000  1.000000  1.000000  1.00000  1.000000    1    1    1    1  

    [5 rows x 101 columns]

Finally, to select specific subsets of the x and y data, according to coding variables stored in data, I just take the index of the relevant subset of data

subject1_index = data[data.subject==1].index
print subject1_index

    Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
    18, 19, 20,  21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
    36, 37, 38, 39], dtype='int64')

and select a matching subset of nx and ny using the iloc method.

sub1_x = nx.iloc[subject1_index]
sub1_y = ny.iloc[subject1_index]
for i in subject1_index:
    plt.plot(nx.iloc[i], ny.iloc[i], 'r', alpha=.3)
plt.plot(sub1_x.mean(), sub1_y.mean(), 'r', linewidth=2)


EDIT: For completeness, note that a lot of my analysis requires long format data (and is carried out in R). Again, there may be a more elegant way of doing this (so use at your own risk!), but my code goes (note, this is real code, from a different dataset, and I haven't bothered to change the variable names to match the original example):

# Long format data
wide_data = data.copy()
steps = nx.columns
for i in steps:
    wide_data['nx_%i' % i] = nx[i]
    wide_data['ny_%i' % i] = ny[i]

id_vars = ['subject_nr', 'condition', 'count_trial_sequence',
    'trial_id', 'choice', 'accuracy']

# Long data with 'nx' as the variable
long_data = pd.melt(wide_data, id_vars=id_vars, value_vars = ['nx_%i' % i for i in steps])
long_data['step'] = long_data.variable.map(lambda s: int(s[3:]))
long_data['nx'] = long_data.value

# Same with 'ny'
tmp_long = pd.melt(wide_data, id_vars=id_vars, value_vars = ['ny_%i' % i for i in steps])
# Combine in single data frame
long_data['ny'] = tmp_long['value']
del tmp_long

long_data = long_data.drop(['variable', 'value'], axis=1)
long_data.to_csv(os.path.join('data', 'long_data.csv'))

long_data.head()
Out[41]: 
       subject_nr      condition  count_trial_sequence  trial_id choice accuracy  
    0   505250022              A                     0        13   rsp1     True   
    1   505250022              A                     1        16   rsp1     True   
    2   505250022              B                     2         2   rsp2    False   
    3   505250022              B                     3         0   rsp1    False   
    4   505250022              C                     4        33   rsp2    False   

       step  nx  ny  
    0     0   0   0  
    1     0   0   0  
    2     0   0   0  
    3     0   0   0  
    4     0   0   0