I wanted to bring this up, just because it's crazy weird. Maybe Wes has some idea. The file is pretty regular: 1100 rows x ~3M columns, data are tab-separated, consisting solely of the integers 0, 1, and 2. Clearly this is not expected.
If I prepopulate a dataframe as below, it consumes ~26GB of RAM.
h = open("ms.txt")
header = h.readline().split("\t")
h.close()
rows=1100
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)
System info:
- python 2.7.9
- ipython 2.3.1
- numpy 1.9.1
- pandas 0.15.2.
Any ideas welcome.
Problem of your example.
Trying your code on small scale, I notice even if you set
dtype=int
, you are actually ending up withdtype=object
in your resulting dataframe.This is because even though you give the
pd.read_csv
function the instruction that the columns aredtype=int
, it cannot override the dtypes being ultimately determined by the data in the column.This is because pandas is tightly coupled to numpy and numpy dtypes.
The problem is, there is no data in your created dataframe, thus numpy defaults the data to be
np.NaN
, which does not fit in an integer.This means numpy gets confused and defaults back to the dtype being
object
.Problem of the object dtype.
Having the dtype set to
object
means a big overhead in memory consumption and allocation time compared to if you would have the dtype set as integer or float.Workaround for your example.
This works just fine, since
np.NaN
can live in a float. This producesAnd should take less memory.
More on how to relate to dtypes
See this related post for details on dtype: Pandas read_csv low_memory and dtype options
The similar problem i had faced with 3 GB data today and i just did little change in my coding style like instead of file.read() and file.readline() method i used below code, that below code just load 1 line at a time in ram
Here is code to convert your data into pandas dataframe.