I am trying to read a CSV file of 1.2G, which contains 25K records, each consists of a id and a large string.
However, around 10K rows, I get this error:
pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
Which seems weird, since the VM has 140GB RAM and at 10K rows the memory usage is only at ~1%.
This is the command I use:
pd.read_csv('file.csv', header=None, names=['id', 'text', 'code'])
I also ran the following dummy program, which could successfully fill up my memory to close to 100%.
list = []
list.append("hello")
while True:
list.append("hello" + list[len(list) - 1])
You can use the command
df.info(memory_usage="deep")
, to find out the memory usage of data being loaded in the data frame.Few things to reduce Memory:
usecols
table.dtypes
for these columnsdtype="category"
. In my experience it reduced the memory usage drastically.This sounds like a job for
chunksize
. It splits the input process into multiple chunks, reducing the required reading memory.This is weird.
Actually I ran into the same situation.
But after I tried a lot of stuff to solve this error. And it works. Like this:
Or this:
BUT!!!!!Suddenlly the original version works fine as well!
Like I did some useless work and I still have no idea where really went wrong.
I don't know what to say.
This error can occur with an invalid csv file, rather than the stated memory error.
I got this error with a file that was much smaller than my available RAM and it turned out that there was an opening double quote on one line without a closing double quote.
In this case, you can check the data, or you can change the quoting behavior of the parser, for example by passing
quoting=3
topd.read_csv
.