R - Why adding 1 column to data table nearly doubl

2020-07-08 06:27发布

问题:

After getting help from 2 kind gentlemen, I managed to switch over to data tables from data frame+plyr.

The Situation and My Questions

As I worked on, I noticed that peak memory usage nearly doubled from 3.5GB to 6.8GB (according to Windows Task Manager) when I added 1 new column using := to my data set containing ~200K rows by 2.5K columns.

I then tried 200M row by 25 col, the increase was from 6GB to 7.6GB before dropping to 7.25GB after a gc().

Specifically regarding adding of new columns, Matt Dowle himself mentioned here that:

With its := operator you can :

Add columns by reference
Modify subsets of existing columns by reference, and by group by reference
Delete columns by reference

None of these operations copy the (potentially large) data.table at all, not even once.

Question 1: why would adding a single column of 'NAs' for a DT with 2.5K columns double the peak memory used if the data.table is not copied at all?

Question 2: Why does the doubling not occur when the DT is 200M x 25? I didn't include the printscreen for this, but feel free to change my code and try.

Printscreens for Memory Usage using Test Code

  1. Clean re-boot, RStudio & MS Word opened - 103MB used

  2. Aft running DT creation code but before adding column - 3.5GB used

  3. After adding 1 Column filled with NA, but before gc() - 6.8GB used

  4. After running gc() - 3.5GB used

Test Code

To investigate, I did up the following test code that closely mimics my data set:

library(data.table)
set.seed(1)

# Credit: Dirk Eddelbuettel's answer in 
# https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
RandDate <- function(N, st="2000/01/01", et="2014/12/31") { 
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- runif(N, 0, dt)
  rt <- as.character( strptime(st + ev, "%Y-%m-%d") )
}

# Create Sample data
TotalNoCol <- 2500
TotalCharCol <- 3
TotalDateCol <- 1
TotalIntCol <- 600
TotalNumCol <- TotalNoCol - TotalCharCol - TotalDateCol - TotalIntCol
nrow <- 200000

ColNames = paste0("C", 1:TotalNoCol)

dt <- as.data.table( setNames( c(

  replicate( TotalCharCol, sample( state.name, nrow, replace = T ), simplify = F ), 
  replicate( TotalDateCol, RandDate( nrow ), simplify = F ), 
  replicate( TotalNumCol, round( runif( nrow, 1, 30 ), 2), simplify = F ), 
  replicate( TotalIntCol, sample( 1:10, nrow, replace = T ), simplify = F ) ), 

    ColNames ) )

gc()

# Add New columns, to be run separately
dt[, New_Col := NA ]  # Additional col; uses excessive memory?

Research Done

I didn't find too much discussion on memory usage for DT with many columns, only this but even then, it's not specifically about memory.

Most discussions on large dataset + memory usage involves DTs with very large rowcount but relatively few columns.

My System

Intel i7-4700 with 4-core/8-thread; 16GB DDR3-12800 RAM; Windows 8.1 64-bit; 500GB 7200rpm HDD; 64-bit R; Data Table ver 1.9.4

Disclaimers

Please pardon me for using a 'non-R' method (i.e. Task Manager) to measure memory used. Memory measurement/profiling in R is something I still haven't figured out.


Edit 1: After updating to data table ver 1.9.5 and re-running. Issue persisted, unfortunately.

回答1:

(I can take no credit as the great DT minds (Arun) have been working on this and found it was related to print.data.table. Just closing the loop here for other SO users.)

It seems this data.table memory issue with := has been solved on R version 3.2 as noted by: https://github.com/Rdatatable/data.table/issues/1062

[Quoting @Arun from Github issue 1062...]

fixed in R v3.2, IIUC, with this item from NEWS:

Auto-printing no longer duplicates objects when printing is dispatched to a method.

So others with this problem should look to upgrading to R 3.2.