I wanted to bring this up, just because it's crazy weird. Maybe Wes has some idea. The file is pretty regular: 1100 rows x ~3M columns, data are tab-separated, consisting solely of the integers 0, 1, and 2. Clearly this is not expected.

If I prepopulate a dataframe as below, it consumes ~26GB of RAM.

h = open("ms.txt")
header = h.readline().split("\t")
h.close()
rows=1100
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

System info:

python 2.7.9
ipython 2.3.1
numpy 1.9.1
pandas 0.15.2.

Any ideas welcome.

标签： python parsing pandas numpy ipython

2条回答

手持菜刀，她持情操

2楼-- · 2019-06-15 08:00

Problem of your example.

Trying your code on small scale, I notice even if you set dtype=int, you are actually ending up with dtype=object in your resulting dataframe.

header = ['a','b','c']
rows = 11
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

df.dtypes
a    object
b    object
c    object
dtype: object

This is because even though you give the pd.read_csv function the instruction that the columns are dtype=int, it cannot override the dtypes being ultimately determined by the data in the column.

This is because pandas is tightly coupled to numpy and numpy dtypes.

The problem is, there is no data in your created dataframe, thus numpy defaults the data to be np.NaN, which does not fit in an integer.

This means numpy gets confused and defaults back to the dtype being object.

Problem of the object dtype.

Having the dtype set to object means a big overhead in memory consumption and allocation time compared to if you would have the dtype set as integer or float.

Workaround for your example.

df = pd.DataFrame(columns=header, index=range(rows), dtype=float)

This works just fine, since np.NaN can live in a float. This produces

a    float64
b    float64
c    float64
dtype: object

And should take less memory.

More on how to relate to dtypes

See this related post for details on dtype: Pandas read_csv low_memory and dtype options

0人赞添加讨论(0) 举报

疯言疯语

3楼-- · 2019-06-15 08:10

The similar problem i had faced with 3 GB data today and i just did little change in my coding style like instead of file.read() and file.readline() method i used below code, that below code just load 1 line at a time in ram

import re

df_list = []

with open("ms.txt", 'r') as f:
    for line in f:
        #process(line)
        line = line.strip()
        columns = re.split("\t", line, maxsplit=4) # you should modify these according to your split criteria
        df_list.append(columns)

Here is code to convert your data into pandas dataframe.

import pandas as pd
df = pd.DataFrame(df_list)# here you will have to modify according to your data frame needs

0人赞添加讨论(0) 举报

Pandas read_csv on 6.5 GB file consumes more than

Problem of your example.

Problem of the object dtype.

Workaround for your example.

More on how to relate to dtypes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间