I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting the following memory error:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
450 infer_datetime_format=infer_datetime_format)
451
--> 452 return _read(filepath_or_buffer, kwds)
453
454 parser_f.__name__ = name
C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
242 return parser
243
--> 244 return parser.read()
245
246 _parser_defaults = {
C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
693 raise ValueError('skip_footer not supported for iteration')
694
--> 695 ret = self._engine.read(nrows)
696
697 if self.options.get('as_recarray'):
C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
1137
1138 try:
-> 1139 data = self._reader.read(nrows)
1140 except StopIteration:
1141 if nrows is None:
C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader.read (pandas\parser.c:7145)()
C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7369)()
C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_rows (pandas\parser.c:8194)()
C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_column_data (pandas\parser.c:9402)()
C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_tokens (pandas\parser.c:10057)()
C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10361)()
C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser._try_int64 (pandas\parser.c:17806)()
MemoryError:
Any help on this??
The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.
solulation1:
Using pandas with large data
solulation2:
For large data l recommend you use the library "dask"
e.g:
Chunking shouldn't always be the first port of call for this problem.
1. Is the file large due to repeated non-numeric data or unwanted columns?
If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv
usecols
parameter.2. Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
3. If all else fails, read line by line via chunks.
Chunk via pandas or via csv library as a last resort.
In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.
If you use pandas read large file into chunk and then yield row by row, here is what I have done