What determines the size of a saved object in R?

2019-06-20 18:40发布

问题:

When I save an object from R using save(), what determines the size of the saved file? Clearly it is not the same (or close to) the size of the object determined by object.size().

Example: I read a data frame and saved it using

snpmat=read.table("Heart.txt.gz",header=T)
save(snpmat,file="datamat.RData")

The size of the file datamat.RData is 360MB.

> object.size(snpmat)
4998850664 bytes        #Much larger

Then I performed some regression analysis and obtained another data frame adj.snpmat of same dimensions (6820000 rows and 80 columns).

> object.size(adj.snpmat)
4971567760 bytes       

I save it using

> save(adj.snpmat,file="adj.datamat.RData")

Now the size of the file adj.datamat.RData is 3.3GB. I'm confused why the two files are so different in size while the object.size() gives similar sizes. Any idea about what determines the size of the saved object is welcome.

Some more information:

> typeof(snpmat)
[1] "list"

> class(snpmat)
[1] "data.frame"

> typeof(snpmat[,1])
[1] "integer"

> typeof(snpmat[,2])
[1] "double"         #This is true for all columns except column 1

> typeof(adj.snpmat)
[1] "list"

> class(adj.snpmat)
[1] "data.frame"

> typeof(adj.snpmat[,1])
[1] "character"

> typeof(adj.snpmat[,2])
[1] "double"         #This is true for all columns except column 1

回答1:

Your matrices are very different and therefore compress very differently.

SNP data contains only a few values (e.g., 1 or 0) and is also very sparse. This means that is very easy to compress. For example, if you had a matrix of all zeros, you could think of compressing the data by specifying a single value (0) as well as the dimensions.

Your regression matrix contains many different types of values, and are also real numbers (I'm assuming p-values, coefficients, etc.) This makes it much less compressible.