I have been asked to change a software that currently exports .Rdata files so that it exports in a 'platform independent binary format' such as HDF5 or netCDF. Two reasons were given:
- Rdata files can only be read by R
- binary information is stored differently depending on operating systems or architecture
I also found that the "R Data import export manual" does not discuss Rdata files although it does discuss HDF5 and netCDF.
A discussion on R-help suggests that .Rdata files are platform independent.
Questions:
- To what extent are these concerns valid?
- e.g. can Matlab read .Rdata without invoking R?
- Are other formats more useful in this respect than .Rdata files?
- Would it be possible to write a script that would create .hdf5 analogues of all .Rdata files, minimizing changes to the program itself?
Here are a variety of answers:
Abundance of options First, the concern is valid, but your list of choices is a little more narrow than it should be. HDF5/netCDF4 is an excellent option, and work well with Python, Matlab, and many other systems. HDF5 is superior to Python's pickle storage in many ways - check out PyTables and you'll very likely see good speedups. Matlab used to have (and may still have) some issues with how large cell (or maybe struct) arrays are stored in HDF5. It's not that it can't do it, but that it was god-awful slow. That's Matlab's problem, not HDF5's. While these are great choices, you may also consider whether HDF5 is adequate: consider if you have some very large files and could benefit from a proprietary encoding, either for speed of access or compression. It's not too hard to do raw binary storage in any language and you could easily design something like the file storage of bigmemory
(i.e. speed of access). In fact, you could even use bigmemory
files in other languages - it's really a very simple format. HDF5 is certainly a good starting point, but there is no one universal solution for data storage and access, especially when one gets to very large data sets. (For smaller data sets, you might also take a look at Protocol Buffers or other serialization formats; Dirk did RProtoBuf
for accessing these in R.) For compression, see the next suggestion.
Size As Dirk mentioned, the file formats can be described as application neutral and application dependent. Another axis is domain-independent (or domain-ignorant) or domain-dependent (domain-smart ;-)) storage. If you have some knowledge of how your data will arise, especially any information that can be used in compression, you may be able to build a better format than anything that standard compressors may be able to do. This takes a bit of work. Alternative compressors than gzip and bzip also allow you to analyze large volumes of data and develop appropriate compression "dictionaries" so that you can get much better compression that you would with .Rdat files. For many kinds of datasets, storing the delta between different rows in a table is a better option - it can lead to much greater compressibility (e.g. lots of 0s may appear), but only you know whether that will work for your data.
Speed and access .Rdat does not support random access. It does not have built-in support for parallel I/O (though you can serialize to a parallel I/O storage, if you wish). There are many things one could do here to improve things, but it's a thousand cuts to glue stuff on to .Rdat over and over again, rather than just switch to a different storage mechanism and blow the speed and access issues away. (This isn't just an advantage of HDF5: I have frequently used multicore functions to parallelize other I/O methods, such as bigmemory
.)
Update capabilities R does not have a very nice way to add objects to a .Rdat file. It does not, to my knowledge, offer any "Viewers" to allow users to visually inspect or search through a collection of .Rdat files. It does not, to my knowledge, offer any built-in versioning record-keeping of objects in the file. (I do this via a separate object in the file, which records the versions of scripts that generated the objects, but I will outsource that to SQLite in a future iteration.) HDF5 has all of these. (Also, the random access affects updating of the data - .Rdat files, you have to save the whole object.)
Communal support Although I've advocated your own format, that is for extreme data sizes. Having libraries built for many languages is very helpful in reducing the friction of exchanging data. For most simple datasets (and simple still means "fairly complex" in most cases) or moderate to fairly large datasets, HDF5 is a good format. There's ways to beat it on specialized systems, certainly. Still, it is a nice standard and will mean less organizational effort will be spent supporting either a proprietary or application-specific format. I have seen organizations stick to a format for many years past the use of the application that generated the data, just because so much code was written to load and save in that application's format and GBs or TBs of data were already stored in its format (this could be you & R someday, but this arose from a different statistical suite, one that begins with the letter "S" and ends with the letter "S" ;-)). That's a very serious friction for future work. If you use a widespread standard format, you can then port between it and other widespread standards with much greater ease: it's very likely someone else has decided to tackle the same problem, too. Give it a try - if you do the converter now, but don't actually convert it for use, at least you have created a tool that others could pick up and use if there comes a time when it's necessary to move to another data format.
Memory With .Rdat files, you have to load
or attach
it in order to access objects. Most of the time, people load
the file. Well, if the file is very big, there goes a lot of RAM. So, either one is a bit smarter about using attach
or separates objects into multiple files. This is quite a nuisance for accessing small parts of an object. To that end, I use memory mapping. HDF5 allows for random access to parts of a file, so you need not load all of your data just to access a small part. It's just part of the way things work. So, even within R, there are better options than just .Rdat files.
Scripts for conversion As for your question about writing a script - yes, you can write a script that loads objects and saves them into HDF5. However, it is not necessarily wise to do this on a huge set of heterogenous files, unless you have a good understanding of what's going to be created. I couldn't begin to design this for my own datasets: there are too many one-off objects in there, and creating a massive HDF5 file library would be ridiculous. It's better to think of it like starting a database: what will you want to store, how will you store it, and how will it be represented and accessed?
Once you get your data conversion plan in place, you can then use tools like Hadoop or even basic multicore functionality to unleash your conversion program and get this done as quickly as possible.
In short, even if you stay in R, you are well advised to look at other possible storage formats, especially for large, growing, data sets. If you have to share data with others, or at least provide read or write access, then other formats are very much advised. There's no reason to spend your time maintaining readers/writers for other languages - it's just data not code. :) Focus your code on how to manipulate data in sensible ways, rather than spend time working on storage - other people have done a very good job on that already.
(Binary) file formats come in two basic flavors:
application-neutral, supported by public libraries and APIs (and both netCDF and HDF5 fall into this camp) which facilitates exchange of data among different programs and applications provided they are extended with add-on packages using the APIs
application-specific ones only designed to work with one program, albeit more efficiently: that is what .RData does
Because R is open-source, you could re-create the format for RData from your Matlab files: Nothing stops you from writing a proper mex file that. Maybe someone has even done it already. There is no technical reason not to try---but the other route may be easier if both applications meant to share the data support the format equally well.
For what it is worth, back in the early/mid-1990s, I did write my own C code to write simulation files in the binary format used by Octave (which I used then slice the data). Being able to do this with open source software is a big plus.
I think I can answer some, but not all of these questions.
Well, anybody who puts their mind to it can probably read an .Rdata
file directly, but it's hard work and not much benefit. So I doubt that Matlab has done that. As you might recall, R can read various other system formats precisely because someone put in a lot of effort to do so.
For text formats csv seem pretty "standard", but for binary formats I don't know - and csv is not a good standard at that - it varies wildly how (especially) dates and quotes are handled (and of course it only works for data tables).
Of course!
Example:
for(f in list.files(".", pattern="\\.Rdata$") {
e <- new.env()
load(f, e) # load all values into environment e
x <- as.list(e)
#saveInOtherFormat(x, file=sub("\\.Rdata$", ".Other", f))
}
Point 2 is wrong: binary .RData files are portable across hardware & OS platforms. To quote from the help page for ?save:
All R platforms use the XDR (bigendian) representation of C ints and doubles in binary save-d files, and these are portable across all R platforms.
Point 1 is a function of what the data are, and what other programs might usefully be applied to the data. If your code base uses save() to write specified objects which are dataframes or matrices, you could easily write a small function save2hdf() to write them out as hdf or ncdf binary files, then use sed to change all occurrences of save( to save2hdf( in your codebase. At least ncdf will have a performance hit on the reads, but not too bad of a hit. If your code uses saves objects like lists of heterogeneous objects, you probably can't use ncdf or hdf without a great deal of recoding to write out separate component objects.
Also note that netCDF 4 is still problematic in R.