I have a 10 GB .dta Stata file and I am trying to read it into 64-bit R 3.3.1. I am working on a virtual machine with about 130 GB of RAM (4 TB HD) and the .dta file is about 3 million rows and somewhere between 400 and 800 variables.
I know data.table() is the fastest way to read in .txt and .csv files, but does anyone have a recommendation for reading largeish .dta files into R? Reading the file into Stata as a .dta file requires about 20-30 seconds, although I need to set my working memory max prior to opening the file (I set the max at 100 GB).
I have not tried importing to .csv in Stata, but I hope to avoid touching the file with Stata. A solution is found via Using memisc to import stata .dta file into R but this assumes RAM is scarce. In my case, I should have sufficient RAM to work with the file.
I recommend the
haven
R package. Unlikeforeign
, It can read the latest Stata formats:Not sure how fast it is compared to other options, but your choices for reading Stata files in R are rather limited. My understanding is that
haven
wraps a C library, so it's probably your fastest option.The fastest way to load a large Stata dataset in R is using the
readstata13
package. I have compared the performance offoreign
,readstata13
, andhaven
packages on a large dataset in this post and the results repeatedly showed thatreadstata13
is the fastest available package for reading Stata dataset in R.