Fastest way to extract hour from time (HH:MM)

2019-04-19 05:38发布

站内文章 / 移动开发

45 0

一夜七次

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Wish fastPOSIXct works - but not working in this case.

Here is my time data (which does not have dates) - and I need to get the hours-part from them.

times <- c("9:46","11:06", "14:17", "19:53", "0:03", "3:56")

Here is the wrong output from fastPOSIXct:

fastPOSIXct(times, "GMT")
[1] "1970-01-01 00:00:00 GMT" "1970-01-01 00:00:00 GMT"
[3] "1970-01-01 00:00:00 GMT" "1970-01-01 00:00:00 GMT"
[5] "1970-01-01 00:00:00 GMT" "1970-01-01 00:00:00 GMT"

It does not recognize the times without the presence of dates correctly.

The hour method from data.table with as.ITime solves the purpose, but looks like slow on large times arrays.

library(data.table)
hour(as.ITime(times))
# [1]  9 11 14 19  0  3

Wondering if there is some faster way (just like fastPOSIXct, but works without the need for date).

fastPOSIXct really works like snap, but just wrong.

回答1:

You may also try substr: as.integer(substr(vals, start = 1, stop = nchar(vals) - 3))

In a benchmark on a vector with 10e6 elements, stringi::stri_sub is fastest, and substr number two.

vals <- sample(c("9:46", "11:06", "14:17", "19:53", "0:03", "3:56"), 1e6, replace = TRUE)

fun_substr <- function(vals) as.integer(substr(vals, start = 1, stop = nchar(vals) - 3))

grab.hrs <- function(vals) as.integer(sub(pattern = ":.*", replacement = "", x = vals))

fun_strtrim <- function(vals) as.integer(strtrim(vals, nchar(vals) - 3))

library(chron)
fun_chron <- function(vals) hours(times(paste0(vals, ":00")))

fun_lt <- function(vals) as.POSIXlt(vals, format="%H:%M")$hour

library(stringi)
fun_stri_sub <- function(vals) as.integer(stri_sub(vals, from = 1, to = -4))

library(microbenchmark)
microbenchmark(fun_substr(vals),
               fun_stri_sub(vals),      
               grab.hrs(vals),
               fun_strtrim(vals),
               fun_lt(vals),
               fun_chron(vals),
               unit = "relative", times = 5)
# Unit: relative
#               expr       min        lq      mean    median        uq       max neval
#   fun_substr(vals)  2.186714  1.902074  2.015082  1.968542  1.945007  2.090236     5
# fun_stri_sub(vals)  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000     5
#     grab.hrs(vals)  2.656630  2.397918  2.687133  2.426223  2.446902  3.263962     5
#  fun_strtrim(vals) 31.177869 27.601380 26.009818 27.423562 17.902507 29.426989     5
#       fun_lt(vals) 47.296929 41.122287 42.266556 40.647465 30.539030 52.710992     5
#    fun_chron(vals)  5.594931  5.159192  5.961775  7.746242  5.286944  6.189742     5

回答2:

You can also do this with the times function from the chron package:

library(chron)
vals <- c("9:46","11:06", "14:17", "19:53", "0:03", "3:56")
dat <- times(paste0(vals, ":00"))
hours(dat)
# [1]  9 11 14 19  0  3

If speed is important, you could extract the hours more quickly with a string manipulation:

grab.hrs <- function(vals) as.numeric(sub(pattern = ":.*", replacement = "",
                                      x = vals))
grab.hrs(vals)
# [1]  9 11 14 19  0  3

times and as.POSIXlt (from @tonytonov's solution) seem to be somewhat quicker than as.ITime, and the string manipulation is much quicker:

library(microbenchmark)
library(data.table)
microbenchmark(hours(times(paste0(vals, ":00"))),
               hours(as.ITime(vals)),
               as.POSIXlt(vals, format="%H:%M")$hour,
               grab.hrs(vals))
# Unit: microseconds
#                                     expr     min       lq   median       uq      max neval
#        hours(times(paste0(vals, ":00"))) 174.544 184.9485 193.5630 204.6950 5047.195   100
#                    hours(as.ITime(vals)) 665.833 678.8790 705.6445 735.0525 3030.574   100
#  as.POSIXlt(vals, format = "%H:%M")$hour 158.264 169.8880 171.9670 180.1800  301.840   100
#                           grab.hrs(vals)  10.637  15.4540  20.0995  21.1285   55.985   100

回答3:

Is this an option? This is a base solution.

as.POSIXlt(times, format="%H:%M")$hour
#[1]  9 11 14 19  0  3

回答4:

To really speed up, you can also just trim off the lsat 3 chars from the strings. It's faster than using regex.

as.numeric(strtrim(times, nchar(times) - 3)) 
## [1]  9 11 14 19  0  3

Here are benchmark results

Unit: microseconds
                                         expr     min       lq   median       uq      max neval
            hours(times(paste0(vals, ":00"))) 200.670 212.9720 218.7960 221.8420  352.370   100
                        hours(as.ITime(vals)) 453.174 478.9680 487.3805 496.7885 1607.321   100
      as.POSIXlt(vals, format = "%H:%M")$hour  41.278  46.4945  49.7310  51.3115   56.453   100
                               grab.hrs(vals)  12.352  15.4295  18.3850  20.3390   31.349   100
  as.numeric(gsub("(.*):.*", "\\\\1", times))  14.528  17.7225  20.6390  23.4530   53.683   100
 as.numeric(strtrim(times, nchar(times) - 3))   9.621  11.6605  12.7435  13.2520  147.446   100

回答5:

You can use the stri_sub function from the stringi package and trim the last 3 characters like this:

require(stringi)
times <- c("9:46", "11:06", "14:17", "19:53", "0:03", "3:56")
stri_sub(times, from = 1, to = -4)
## [1] "9"  "11" "14" "19" "0"  "3"

If from and/or to parameters are negative then counting is done from the end of a string. So in this example the substring is from the first character to the fourth one but counting from the end of string.

回答6:

str_sub or substr will always be handy in this situation. For example, the following code is for substr:

times <- c("9:46", "11:06", "14:17", "19:53", "0:03", "3:56")

times1 <- str_pad(times,5,pad='0')

times1
## [1]"09:46", "11:06", "14:17", "19:53", "00:03", "03:56"

Substr(times1,1,2)
## [1] "09"  "11" "14" "19" "00"  "03"