Speedup conversion of 2 million rows of date strin

2020-02-05 03:19发布

问题:

I have a csv which includes about 2 million rows of date strings in the format:

2012/11/13 21:10:00 

Lets call that csv$Date.and.Time

I want to convert these dates (and their accompanying data) to xts as fast as possible

I have written a script which performs the conversion just fine (see below), but it's terribly slow and I'd like to speed this up as much as possible.

Here is my current methodology. Does anyone have any suggestions on how to make this faster?

 dt <- as.POSIXct(csv$Date.and.Time,tz="UTC")

idx <- format(dt,tz=z,usetz=TRUE)

So the script converts these date strings to POSIX.ct. It then does a timezone conversion using format (z is a variable representing the TZ to which I am converting). I then do a regular xts call to make this an xts series with the rest of the data in the csv.

This works 100%. It's just very, very slow. I've tried running this in parallel (it doesn't do anything; if anything it makes it worse). What do I mean by 'slow'?

 user    system   elapsed 
155.246  16.430 171.650 

That's on a 3GhZ, 16GB ram 2012 mb pro. I can get about half that on a similar processor with 32GB RAM on a Win7 Machine

I'm sure someone has a better idea - I'm open to suggestions via Rcpp etc. However, ideally the solution works with the csv rather than some other method, like setting up a database. Having said that, I'm up to doing this via whatever method is going to give the fastest conversion.

I'd be super appreciative of any help at all. Thanks in advance.

回答1:

You want the small and simple fasttime package by Simon which does this in the fastest possible way---by not calling time parsing functions but just using C-level string functions.

It does not support as many formats as strptime. In fact, it doesn't even have a format string. But well-formed ISO format variants, that is yyyy-mm-dd hh:mm:ss.fff will work, and your / separator may just work too.



回答2:

Try using lubridate - it does all date time parsing using regular expressions, so not only is it much faster, it's also much more flexible.