When calling
df = pd.read_csv('somefile.csv')
I get:
/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.
Why is the dtype
option related to low_memory
, and why would making it False
help with this problem?
The deprecated low_memory option
The
low_memory
option is not properly deprecated, but it should be, since it does not actually do anything differently[source]The reason you get this
low_memory
warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.Dtype Guessing (very bad)
Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.
Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.
Specifying dtypes (should always be done)
adding
to the
pd.read_csv()
call will make pandas know when it starts reading the file, that this is only integers.Also worth noting is that if the last line in the file would have
"foobar"
written in theuser_id
column, the loading would crash if the above dtype was specified.Example of broken data that breaks when dtypes are defined
dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html
What dtypes exists?
These are the numpy dtypes that are also accepted in pandas
Pandas also adds two dtypes:
categorical
anddatetime64[ns, tz]
that are not available in numpyPandas dtype reference
Gotchas, caveats, notes
Setting
dtype=object
will silence the above warning, but will not make it more memory efficient, only process efficient if anything.Setting
dtype=unicode
will not do anything, since to numpy, aunicode
is represented asobject
.Usage of converters
@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering
'foobar'
in a column specified asint
. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.
As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.
I had a similar issue with a ~400MB file. Setting
low_memory=False
did the trick for me. Do the simple things first,I would check that your dataframe isn't bigger than your system memory, reboot, clear the RAM before proceeding. If you're still running into errors, its worth making sure your.csv
file is ok, take a quick look in Excel and make sure there's no obvious corruption. Broken original data can wreak havoc...This should solve the issue. I got exactly the same error, when reading 1.8M rows from a CSV.
Try:
According to the pandas documentation:
As for low_memory, it's True by default and isn't yet documented. I don't think its relevant though. The error message is generic, so you shouldn't need to mess with low_memory anyway. Hope this helps and let me know if you have further problems