R: nested loop with non-numeric index

2019-09-11 03:19发布

I am a political science student and learning R. I have a problem with a nested loop, one of my indices being non-numeric. I have a data frame pwt containing, for each country in the world (column country) and each year from 1950 to 2011 (column year) a number of development indicators, among which is GDP. I would like to add a column that contains the % change in GDP from a year to the next.

Here is the error I get:

Error in `[<-.factor`(`*tmp*`, iseq, value = numeric(0)):  replacement has length zero

GDPgrowth = rep("NA", length(pwt$country))
pwt <- cbind.data.frame(pwt, GDPgrowth)
countries <- unique(pwt$country)
for(i in countries)  # for each country
{
  for(j in 1951:2011) # for each year
  {
    pwt[pwt$country == i & pwt$year == j,"GDPgrowth"] = (pwt[pwt$country == i 
& pwt$year == j,"rdgpo"]/pwt[pwt$country == i & pwt$year == j-1,"rdgpo"] - 
1)*100
  }
}

What did I get wrong?

4条回答
闹够了就滚
2楼-- · 2019-09-11 03:51

Welcome to Stack Overflow!

For this sort of rolling/thing-over-thing, etc. you can use zoo, dplyr, or data.table. I personally prefer the latter for its flexibility and (running) speed for large datasets. Vs. using a loop, these will generally be faster and more syntactically convenient.

Assuming your data looks something like this (numbers obviously made up):

country year rgdp
USA     1991 1000
USA     1992 1200
USA     1993 1500
SWE     1991 1000
SWE     1992 900
SWE     1993 2000

You can use data.table's shift to calculate values from leading/lagging values. In this case:

library(data.table)

pwt <- as.data.table(list(country=c("USA", "USA", "USA", "SWE", "SWE", "SWE"),
                          year=c(1991, 1992, 1993, 1991, 1992, 1993),
                          rgdp=c(1000, 1200, 1500, 1000, 900, 2000)))

pwt[, growth := rgdp/shift(rgdp, n=1, type="lag") - 1, by=c("country")]

Gives:

country year rgdp growth
USA     1991 1000 NA
USA     1992 1200 0.200000
USA     1993 1500 0.250000
SWE     1991 1000 NA
SWE     1992 900 -0.100000
SWE     1993 2000 1.222222
查看更多
Ridiculous、
3楼-- · 2019-09-11 04:04

Another way would be to use diff from base R. This is used to calculate difference between immediate values

difference<-c(0,diff(pwd$gdp))

This would give you difference between consecutive GDP's which you can easily use to find percentage grouth.

PS: SO is to help people out and not provide exact solution and spoon feed. Thus this answer just points you in a direction and not gives you exact solution.

查看更多
成全新的幸福
4楼-- · 2019-09-11 04:12

You can also avoid the loop:

p <- pwd[, c('country', 'year', 'rdgpo')]
p$year <- p$year + 1
colnames(p)[3] <- 'rdgpo_prev'

pwd <- merge(pwd, p, all.x=TRUE)
pwd$GDPgrowth <- 100 * ((pwd$rdgpo/pwd$rdgpo_prev) -1)
pwd$rdgpo_prev <- NULL
查看更多
三岁会撩人
5楼-- · 2019-09-11 04:15

By the same token another convenient solution avoiding the loop can be achieved with use of dplyr.

# Install and data download -----------------------------------------------

# World Bank Data pkg
install.packages('WDI')
require(WDI)

#' Source data
#' NYGDPMKTPCD correspond to "GDP, PPP (constant 2005 international $)"
#' Check WDIsearch() for codes
pwt <- WDI(country = "all", indicator = "NY.GDP.MKTP.PP.CD",
           start = 1951, end = 2011, extra = FALSE, cache = NULL)

# Percentage change on panel data -----------------------------------------

library(dplyr)
pwt <- pwt %>%
    group_by(country) %>%
    arrange(year) %>%
    mutate(pct.chg = 100 * 
               ((NY.GDP.MKTP.PP.CD - lag(NY.GDP.MKTP.PP.CD))/lag(NY.GDP.MKTP.PP.CD)))

As a side point I would suggest that, in line with the SO guidelines, you provide reproducible example. In terms of major publicly available statistical repositories (Eurostat, OECD, World Bank, etc.) there are R packagaes and tutorials that make sourcing the desired data effortless. In the example above I'm using the WDI package to source the World Bank data.

Edit

Finally, if you insist on making things in the loop you can do it like that:

for (i in unique(pwt$country))  {
    # Assuming that years are incomplete
    for (j in unique(pwt$year[pwt$country == i])) {
        # As the DF is simple i simply used column numbers
        pwt[which(
            pwt$year == j & 
                pwt$country == i) +1 ,6] <- 100 * ((pwt[which(pwt$year == j & 
                                                                  pwt$country == i)  +1 ,3]
                                                    - pwt[which(pwt$year == j & 
                                                                    pwt$country == i),3]) 
                                                   / abs(pwt[which(pwt$year == j & 
                                                                       pwt$country == i),3]))
    }
}

The solution could be less explicit but I wanted to emphasise the need of picking the right row for each combination of year and country that is implemented in the which statement.

Benchmarking

The loop approach appears to be rather inefficient:

require(microbenchmark)
microbenchmark(dpl_sol(), bse_sol(), times = 1)
Unit: milliseconds
      expr         min          lq        mean      median          uq         max neval
 dpl_sol()    21.26792    21.26792    21.26792    21.26792    21.26792    21.26792     1
 bse_sol() 94573.05671 94573.05671 94573.05671 94573.05671 94573.05671 94573.05671     1

Benchmarked functions from above

dpl_sol <- function() {
    pwt <- pwt %>%
        group_by(country) %>%
        arrange(year) %>%
        mutate(pct.chg = 100 * 
                   ((NY.GDP.MKTP.PP.CD - lag(NY.GDP.MKTP.PP.CD))/lag(NY.GDP.MKTP.PP.CD)))
}
bse_sol <- function() {
    pwt$pct.chg2 <- NA # Column 6
    for (i in unique(pwt$country))  {
        # Assuming that years are incomplete
        for (j in unique(pwt$year[pwt$country == i])) {
            # As the DF is simple i simply used column numbers
            pwt[which(
                pwt$year == j & 
                    pwt$country == i) +1 ,6] <- 100 * ((pwt[which(pwt$year == j & 
                                                                      pwt$country == i)  +1 ,3]
                                                        - pwt[which(pwt$year == j & 
                                                                        pwt$country == i),3]) 
                                                       / abs(pwt[which(pwt$year == j & 
                                                                           pwt$country == i),3]))
        }
    }

}
查看更多
登录 后发表回答