how to avoid loops

HI All, I'm new to R.

I have two panel data files, with columns "id", "date" and "ret"

file A has a lot more data than file B, but i'm primarily working with file B data.

Combination of "id" and "date" is unqiue indentifier.

Is there an elegent way of looking up for each (id, date) in B, I need to get the past 10 days ret from file A, and store them back into B?

my naive way of doing it is to loop for all rows in B,

for i in 1:length(B) {
    B$past10d[i] <- prod(1+A$ret[which(A$id == B$id[i] & A$date > B$date[i]-10 & A$date < B$date[i])])-1
}

but the loops takes forever.

Really appreciate your thoughts.

Thank you very much.

标签： r loops vectorization

7条回答

虎瘦雄心在

2楼-- · 2019-06-20 00:38

Given that you're having memory issues perhaps paring down A first might help. First, get rid of extraneous ids.

A <- A[A$id %in% B$id,]

Reducing the A dataset completely still wants to grab more memory. It's not possible without storing some variables. Nevertheless, we can get rid of a bunch of it I'm hoping by lopping off every date below our absolute minimum and above our absolute maximum.

A <- A[A$date > (min(B$date) - 10) & A$date <= max(B$date),]

Of course, by not qualifying this by id we haven't get the smallest version of A possible but hopefully it's enough smaller.

Now run the code I first proposed and see if you still have a memory error

B$idDate <- factor(B$id):factor(B$date)
B$past10 <- sapply(B$idDate, function(x){with(B[B$idDate == x,], 
    prod(1+A$ret[A$id == id & A$date > date-10 & A$date < date])-1)})

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-06-20 00:40

I think the key is to vectorize and use the %in% operator to subset data frame A. And, I know, prices are not random numbers, but I didn't want to code a random walk... I created a stock-date index using paste, but I'm sure you could use the index from pdata.frame in the plm library, which is the best I've found for panel data.

A  <- data.frame(stock=rep(1:10, each=100), date=rep(Sys.Date()-99:0, 10), price=rnorm(1000))
B <- A[seq(from=100, to=1000, by=100), ]
A <- cbind(paste(A$stock, A$date, sep="-"), A)
B <- cbind(paste(B$stock, B$date, sep="-"), B)
colnames(A) <- colnames(B) <- c("index", "stock", "date", "price")
index <- which(A[, 1] %in% B[, 1])
returns <- (A$price[index] - A$price[index-10]) / A$price[index-10]
B <- cbind(B, returns)

0人赞添加讨论(0) 举报

老娘就宠你

4楼-- · 2019-06-20 00:43

library(data.table)
#create data
A  <- data.table(id=rep(1:10, each=40000), date=rep(Sys.Date()-99:0,  4000), ret=rnorm(400000))
B  <- data.table(id=rep(1:5,  each=10), date=rep(Sys.Date()-99:0),  ret=rnorm(50))

#find dates to compare against
n <- NROW(B)
B_long <- B[,.(id = rep(id,each=10),date = rep(date,each=10))]
s <- rep(-10:-1,n)
B_long[,date:=date + s]

#information in one column
B_long$com <- as.numeric(paste0(B_long$id,as.numeric(B$date)))
A$com <- as.numeric(paste0(A$id,as.numeric(A$date)))

#compare
setkey(A,com)
X <- A[com %in% B_long$com,]

This answer builds on Richards answer but is more targeted to the question.

Key idea is to build one vector of id date combinations to compare against. This happens in the second code block.

My solution uses the data.table package but should with some syntax changes work with a data.frame. But using the data.table package has the advantage of keycolumns.

If you still have trouble you can pair this approach with john's second answer and first crop A.

0人赞添加讨论(0) 举报

手持菜刀，她持情操

5楼-- · 2019-06-20 00:51

Did you try ?merge ?

"Merge two data frames by common columns or row names, or do other versions of database join operations. "

Besides I suggest to use a little local MySQL / PostgreSQL (RMySQL / RPostgreSQL) database if you continously sport composite PKs or whatsoever as unique identifiers. To me SQL rearranging of data and afterwards using data.frames from view is a lot easier than looping.

0人赞添加讨论(0) 举报

对你真心纯属浪费

6楼-- · 2019-06-20 00:52

Is this any faster? (I am assuming the combination of B$id and B$date is a unique identifier not replicated anywhere - implied by your code)

B$idDate <- factor(B$id):factor(B$date)
B$past10 <- sapply(B$idDate, function(x){with(B[B$idDate == x,], 
    prod(1+A$ret[A$id == id & A$date > date-10 & A$date < date])-1)})

0人赞添加讨论(0) 举报

Melony?

7楼-- · 2019-06-20 00:52

If you haven't got data that is replicated in both A and B, then rbind is the simplest solution.

#Sample data
A <- data.frame(
  id = rep(letters[1:3], each = 13),
  date = Sys.Date() + -12:0,
  ret = runif(39)
)

B <- data.frame(
  id = rep(letters[5:6], each = 5),
  date = Sys.Date() + -4:0,
  ret = runif(10)
)

#Only take the last ten days from A
A_past_10_days <- A[A$date > Sys.Date() - 10,]

#Bind by rows
rbind(A_past_10_days, B)

0人赞添加讨论(0) 举报

1 2 下一页

how to avoid loops

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间