After getting help from 2 kind gentlemen, I managed to switch over to data tables from data frame+plyr.
The Situation and My Questions
As I worked on, I noticed that peak memory usage nearly doubled from 3.5GB to 6.8GB (according to Windows Task Manager) when I added 1 new column using :=
to my data set containing ~200K rows by 2.5K columns.
I then tried 200M row by 25 col, the increase was from 6GB to 7.6GB before dropping to 7.25GB after a gc()
.
Specifically regarding adding of new columns, Matt Dowle himself mentioned here that:
With its := operator you can :
Add columns by reference Modify subsets of existing columns by reference, and by group by reference Delete columns by reference
None of these operations copy the (potentially large) data.table at all, not even once.
Question 1: why would adding a single column of 'NAs' for a DT with 2.5K columns double the peak memory used if the data.table is not copied at all?
Question 2: Why does the doubling not occur when the DT is 200M x 25? I didn't include the printscreen for this, but feel free to change my code and try.
Printscreens for Memory Usage using Test Code
Clean re-boot, RStudio & MS Word opened - 103MB used
Aft running DT creation code but before adding column - 3.5GB used
After adding 1 Column filled with NA, but before gc() - 6.8GB used
After running gc() - 3.5GB used
Test Code
To investigate, I did up the following test code that closely mimics my data set:
library(data.table)
set.seed(1)
# Credit: Dirk Eddelbuettel's answer in
# https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
RandDate <- function(N, st="2000/01/01", et="2014/12/31") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- runif(N, 0, dt)
rt <- as.character( strptime(st + ev, "%Y-%m-%d") )
}
# Create Sample data
TotalNoCol <- 2500
TotalCharCol <- 3
TotalDateCol <- 1
TotalIntCol <- 600
TotalNumCol <- TotalNoCol - TotalCharCol - TotalDateCol - TotalIntCol
nrow <- 200000
ColNames = paste0("C", 1:TotalNoCol)
dt <- as.data.table( setNames( c(
replicate( TotalCharCol, sample( state.name, nrow, replace = T ), simplify = F ),
replicate( TotalDateCol, RandDate( nrow ), simplify = F ),
replicate( TotalNumCol, round( runif( nrow, 1, 30 ), 2), simplify = F ),
replicate( TotalIntCol, sample( 1:10, nrow, replace = T ), simplify = F ) ),
ColNames ) )
gc()
# Add New columns, to be run separately
dt[, New_Col := NA ] # Additional col; uses excessive memory?
Research Done
I didn't find too much discussion on memory usage for DT with many columns, only this but even then, it's not specifically about memory.
Most discussions on large dataset + memory usage involves DTs with very large rowcount but relatively few columns.
My System
Intel i7-4700 with 4-core/8-thread; 16GB DDR3-12800 RAM; Windows 8.1 64-bit; 500GB 7200rpm HDD; 64-bit R; Data Table ver 1.9.4
Disclaimers
Please pardon me for using a 'non-R' method (i.e. Task Manager) to measure memory used. Memory measurement/profiling in R is something I still haven't figured out.
Edit 1: After updating to data table ver 1.9.5 and re-running. Issue persisted, unfortunately.