Efficient timedelta calculator

2019-08-22 18:22发布

问题:

I have a time series data from a data logger that puts time stamps (in the form of dates MM--DD-YY HH:MM:SS:xxx:yyy (e.g. --[ 29.08.2018 16:26:31.406 ] --) where xxx and yyy are milliseconds and microseconds respectively) precise up to microseconds when recording data. Now you can imagine that the generated file recorded over a few minutes could be very big. (100s of megabytes). I need to plot a bunch of data from this file vs time in millisconds (ideally). The data looks like below:

So I need to parse these dates in python and calculate timedelta to find timelapsed between samples and then generate plots. As when I subtract these two time stamps (--[ 29.08.2018 16:23:41.052 ] -- and --[ 29.08.2018 16:23:41.114 ] --), I want to get 62 milliseconds as time lapsed between these two time stamps.

Currently I am using 'dateparser' (by import dateparser as dp) which outputs datetime after parsing and then I can subtract those to extract timedelta and then convert into ms or seconds as I need. But this function is taking too long and is the bottleneck in my post processing script.

Anyone could suggest a better library that is more efficient in parsing dates and calculating timedelta?

Here's the piece of code that is not so efficient

import dateparser as dp
def timedelta_local(date1, date2):
import dateparser as dp
timedelta = dp.parse(date2)-dp.parse(date1)
timediff={'us': timedelta.microseconds+timedelta.seconds*1000000+timedelta.days*24*60*60*1000000,
          'ms':timedelta.microseconds/1000+timedelta.seconds*1000+timedelta.days*24*60*60*1000,
          'sec': timedelta.microseconds/1000000+timedelta.seconds+timedelta.days*24*60*60,
          'minutes': timedelta.microseconds/1000000/60+timedelta.seconds/60+timedelta.days*24*60
         }
return timediffe

Thanks in advance

回答1:

@zvone is correct here. pandas is your best friend for this. Here is some sample code that will hopefully get you on the right track. It assumes your data is in a CSV file with a header line like the one you show in your example. I wasn't sure whether you wanted to keep the time difference as a timedelta object (easy for doing further math with) or just simplify it to a float. I did both.

import pandas as pd

df = pd.read_csv("test.csv", parse_dates=[0])

# What are the data types after the initial import?

print(f'{df.dtypes}\n\n')

# What are the contents of the data frame?

print(f'{df}\n\n')

# Create a new column that strips away leading and trailing characters 
# that surround the data we want

df['Clean Time Stamp'] = df['Time Stamp'].apply(lambda x: x[3:-4])

# Convert to a pandas Timestamp. Use infer_datetime_format for speed.

df['Real Time Stamp'] = pd.to_datetime(df['Clean Time Stamp'], infer_datetime_format=True)

# Calculate time difference between successive rows

df['Delta T'] = df['Real Time Stamp'].diff()

# Convert pandas timedelta to a floating point value in milliseconds.

df['Delta T ms'] = df['Delta T'].dt.total_seconds() * 1000

print(f'{df.dtypes}\n\n')
print(df)

The output looks like this. Note that the printing of the dataframe is wrapping the columns around to another line - this is just an artifact of printing it.

Time Stamp     object
 Limit A        int64
 Value A      float64
 Limit B        int64
 Value B      float64
dtype: object


                         Time Stamp   Limit A   Value A   Limit B   Value B
0  --[ 29.08.2018 16:23:41.052 ] --        15     3.109        30     2.907
1  --[ 29.08.2018 16:23:41.114 ] --        15     3.020        30     8.242


Time Stamp                   object
 Limit A                      int64
 Value A                    float64
 Limit B                      int64
 Value B                    float64
Clean Time Stamp             object
Real Time Stamp      datetime64[ns]
Delta T             timedelta64[ns]
Delta T ms                  float64
dtype: object


                         Time Stamp   Limit A   Value A   Limit B   Value B  \
0  --[ 29.08.2018 16:23:41.052 ] --        15     3.109        30     2.907   
1  --[ 29.08.2018 16:23:41.114 ] --        15     3.020        30     8.242   

            Clean Time Stamp         Real Time Stamp         Delta T  \
0   29.08.2018 16:23:41.052  2018-08-29 16:23:41.052             NaT   
1   29.08.2018 16:23:41.114  2018-08-29 16:23:41.114 00:00:00.062000   

   Delta T ms  
0         NaN  
1        62.0  

If your files are large you may gain some efficiency by editing columns in place rather than creating new ones like I did.