Using R with tidyquant and massiv data

2019-04-13 17:51发布

While working with R I encountered a strange problem: I am processing date in the follwing manner: Reading data from a database into a dataframe, filling missing values, grouping and nesting the data to a combined primary key, creating a timeseries and forecastting it for every group, ungroup and clean the data, write it back into the DB.

Somehting like this: https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html

For small data sets this works like a charm, but with lager ones (over about 100000 entries) I do get the "R Session Aborted" screen from R-Studio and the nativ R GUI just stops execution and implodes. There is no information in every log file that I've look into. I suspect that it is some kind of (leaking) memory issue.

As a work around I'm processing the data in chunks with a for-loop. But no matter how small the chunk size is, I do get the "R Session Aborted" screen, which looks a lot like leaking memory. The whole date consist of about 5 million rows.

I've looked a lot into packages like ff, the big-Family and matter basically everything from https://cran.r-project.org/web/views/HighPerformanceComputing.html but this dose not seem to work well with tibbles and the tidyverse way of data processing.

So, how can I improve my scrip to work with massiv amounts of data? How can I gather clues about why the R Session is Aborted?

2条回答
走好不送
2楼-- · 2019-04-13 18:28

Check out the article at:

datascience.la/dplyr-and-a-very-basic-benchmark

There is a table that shows runtime comparisons for some of the data wrangling tasks you are performing. From the table, it looks as though dplyr with data.table behind it is likely going to do much better than dplyr with a dataframe behind it.

There’s a link to the benchmarking code used to make the table, too.

In short, try adding a key, and try using data.table over dataframe.

To make x your key, and say your data.table is named dt, use setkey(dt,x).

查看更多
何必那么认真
3楼-- · 2019-04-13 18:41

While Pakes answer deals with the described problem I found a solution to the underlying problem. For Compatibility reason I used R in the 3.4.3 version. Now I'm using the newer 3.5.1 version which works quite fine.

查看更多
登录 后发表回答