I apologize in advance if this has a simple answer somewhere. It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling.
I'm working with some datasets that are several GB right now. It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta()
to only read in the first N rows? This would of course be much faster.
I could also use a proper format like .csv and then use read.csv()
's nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred.
To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. I probably do it different for every project, but something like this:
data creation .do file
analysis .do file
Stata's binary files are written row-by-row, so you could change the
R_LoadStataData
function instataread.c
to limit the number of rows read in. However, this will only work if you do not need the value labels because they are written at the end of the file and would require you to read the entire file--which wouldn't save any time.That's going to be a difficult one, as the
do_readStata
function under the hood is compiled code, only capable of taking in the whole file. I believe that in general binary files are hard to read line by line, and.dta
is a binary format. Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in.In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code
sample 1000, count
will give you a sample of 1000 observations from the loaded dataset), and work with them. And if you have no access to Stata, someone else in the project should be able to do that for you.