Suppose we have a folder containing multiple data.csv files, each containing the same number of variables but each from different times. Is there a way in R to import them all simultaneously rather than having to import them all individually?
My problem is that I have around 2000 data files to import and having to import them individually just by using the code:
read.delim(file="filename", header=TRUE, sep="\t")
is not very efficient.
This is a part of my script.
if you want to collect different csv files into one data.frame, you can use the following. note that the "x" data.frame shoud be created in advance.
Here is another options to convert the .csv files into one data.frame. Using R base functions. This is order of magnitude slower than the options below.
Edit: - A few more extra choices using
data.table
andreadr
A
fread()
version, which is a function of thedata.table
package. This should be the fastest option.Using readr, which is a new hadley package for reading csv files. A bit slower than fread but with different functionalities.
Using
plyr::ldply
there is roughly a 50% speed increase by enabling the.parallel
option while reading 400 csv files roughly 30-40 MB each. Example includes a text progress bar.A speedy and succinct
tidyverse
solution: (more than twice as fast as Base R'sread.csv
)and data.table's
fread()
can even cut those load times by half again. (for 1/4 the Base R times)The
stringsAsFactors = FALSE
argument keeps the dataframe factor free.If the typecasting is being cheeky, you can force all the columns to be as characters with the
col_types
argument.If you are wanting to dip into subdirectories to construct your list of files to eventually bind, then be sure to include the path name, as well as register the files with their full names in your list. This will allow the binding work to go on outside of the current directory. (Thinking of the full pathnames as operating like passports to allow movement back across directory 'borders'.)
As Hadley describes here (about halfway down):
Bonus Feature - adding filenames to the records per Niks feature request in comments below:
* Add original
filename
to each record.Code explained: make a function to append the filename to each record during the initial reading of the tables. Then use that function instead of the simple
read_csv()
function.(The typecasting and subdirectory handling approaches can also be handled inside the
read_plus()
function in the same manner as illustrated in the second and third variants suggested above.)Middling Use Case
Larger Use Case
Variety of Use Cases
Rows: file counts (1000, 100, 10)
Columns: final dataframe size (5MB, 50MB, 500MB)
(click on image to view original size)
The base R results are better for the smallest use cases where the overhead of bringing the C libraries of purrr and dplyr to bear outweigh the performance gains that are observed when performing larger scale processing tasks.
if you want to run your own tests you may find this bash script helpful.
bash what_you_name_this_script.sh "fileName_you_want_copied" 100
will create 100 copies of your file sequentially numbered (after the initial 8 characters of the filename and an underscore).Attributions and Appreciations
With special thanks to:
map_df()
here.You can use the superb
sparklyr
package for this: