I am a political science student and learning R. I have a problem with a nested loop, one of my indices being non-numeric.
I have a data frame pwt
containing, for each country in the world (column country
) and each year from 1950 to 2011 (column year
) a number of development indicators, among which is GDP.
I would like to add a column that contains the % change in GDP from a year to the next.
Here is the error I get:
Error in `[<-.factor`(`*tmp*`, iseq, value = numeric(0)): replacement has length zero
GDPgrowth = rep("NA", length(pwt$country))
pwt <- cbind.data.frame(pwt, GDPgrowth)
countries <- unique(pwt$country)
for(i in countries) # for each country
{
for(j in 1951:2011) # for each year
{
pwt[pwt$country == i & pwt$year == j,"GDPgrowth"] = (pwt[pwt$country == i
& pwt$year == j,"rdgpo"]/pwt[pwt$country == i & pwt$year == j-1,"rdgpo"] -
1)*100
}
}
What did I get wrong?
Welcome to Stack Overflow!
For this sort of rolling/thing-over-thing, etc. you can use zoo, dplyr, or data.table. I personally prefer the latter for its flexibility and (running) speed for large datasets. Vs. using a loop, these will generally be faster and more syntactically convenient.
Assuming your data looks something like this (numbers obviously made up):
country year rgdp
USA 1991 1000
USA 1992 1200
USA 1993 1500
SWE 1991 1000
SWE 1992 900
SWE 1993 2000
You can use data.table's shift to calculate values from leading/lagging values. In this case:
library(data.table)
pwt <- as.data.table(list(country=c("USA", "USA", "USA", "SWE", "SWE", "SWE"),
year=c(1991, 1992, 1993, 1991, 1992, 1993),
rgdp=c(1000, 1200, 1500, 1000, 900, 2000)))
pwt[, growth := rgdp/shift(rgdp, n=1, type="lag") - 1, by=c("country")]
Gives:
country year rgdp growth
USA 1991 1000 NA
USA 1992 1200 0.200000
USA 1993 1500 0.250000
SWE 1991 1000 NA
SWE 1992 900 -0.100000
SWE 1993 2000 1.222222
Another way would be to use diff
from base R
. This is used to calculate difference between immediate values
difference<-c(0,diff(pwd$gdp))
This would give you difference between consecutive GDP's which you can easily use to find percentage grouth.
PS: SO is to help people out and not provide exact solution and spoon feed. Thus this answer just points you in a direction and not gives you exact solution.
You can also avoid the loop:
p <- pwd[, c('country', 'year', 'rdgpo')]
p$year <- p$year + 1
colnames(p)[3] <- 'rdgpo_prev'
pwd <- merge(pwd, p, all.x=TRUE)
pwd$GDPgrowth <- 100 * ((pwd$rdgpo/pwd$rdgpo_prev) -1)
pwd$rdgpo_prev <- NULL
By the same token another convenient solution avoiding the loop can be achieved with use of dplyr
.
# Install and data download -----------------------------------------------
# World Bank Data pkg
install.packages('WDI')
require(WDI)
#' Source data
#' NYGDPMKTPCD correspond to "GDP, PPP (constant 2005 international $)"
#' Check WDIsearch() for codes
pwt <- WDI(country = "all", indicator = "NY.GDP.MKTP.PP.CD",
start = 1951, end = 2011, extra = FALSE, cache = NULL)
# Percentage change on panel data -----------------------------------------
library(dplyr)
pwt <- pwt %>%
group_by(country) %>%
arrange(year) %>%
mutate(pct.chg = 100 *
((NY.GDP.MKTP.PP.CD - lag(NY.GDP.MKTP.PP.CD))/lag(NY.GDP.MKTP.PP.CD)))
As a side point I would suggest that, in line with the SO guidelines, you provide reproducible example. In terms of major publicly available statistical repositories (Eurostat, OECD, World Bank, etc.) there are R packagaes and tutorials that make sourcing the desired data effortless. In the example above I'm using the WDI package to source the World Bank data.
Edit
Finally, if you insist on making things in the loop you can do it like that:
for (i in unique(pwt$country)) {
# Assuming that years are incomplete
for (j in unique(pwt$year[pwt$country == i])) {
# As the DF is simple i simply used column numbers
pwt[which(
pwt$year == j &
pwt$country == i) +1 ,6] <- 100 * ((pwt[which(pwt$year == j &
pwt$country == i) +1 ,3]
- pwt[which(pwt$year == j &
pwt$country == i),3])
/ abs(pwt[which(pwt$year == j &
pwt$country == i),3]))
}
}
The solution could be less explicit but I wanted to emphasise the need of picking the right row for each combination of year and country that is implemented in the which
statement.
Benchmarking
The loop approach appears to be rather inefficient:
require(microbenchmark)
microbenchmark(dpl_sol(), bse_sol(), times = 1)
Unit: milliseconds
expr min lq mean median uq max neval
dpl_sol() 21.26792 21.26792 21.26792 21.26792 21.26792 21.26792 1
bse_sol() 94573.05671 94573.05671 94573.05671 94573.05671 94573.05671 94573.05671 1
Benchmarked functions from above
dpl_sol <- function() {
pwt <- pwt %>%
group_by(country) %>%
arrange(year) %>%
mutate(pct.chg = 100 *
((NY.GDP.MKTP.PP.CD - lag(NY.GDP.MKTP.PP.CD))/lag(NY.GDP.MKTP.PP.CD)))
}
bse_sol <- function() {
pwt$pct.chg2 <- NA # Column 6
for (i in unique(pwt$country)) {
# Assuming that years are incomplete
for (j in unique(pwt$year[pwt$country == i])) {
# As the DF is simple i simply used column numbers
pwt[which(
pwt$year == j &
pwt$country == i) +1 ,6] <- 100 * ((pwt[which(pwt$year == j &
pwt$country == i) +1 ,3]
- pwt[which(pwt$year == j &
pwt$country == i),3])
/ abs(pwt[which(pwt$year == j &
pwt$country == i),3]))
}
}
}