Python Pandas Error tokenizing data

2019-01-01 00:42发布

问题:

I\'m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

I have tried to read the pandas docs, but found nothing.

My code is simple:

path = \'GOOG Key Ratios.csv\'
#print(open(path).read())
data = pd.read_csv(path)

How can I resolve this? Should I use the csv module or another language ?

File is from Morningstar

回答1:

you could also try;

data = pd.read_csv(\'file1.csv\', error_bad_lines=False)


回答2:

It might be an issue with

  • the delimiters in your data
  • the first row, as @TomAugspurger noted

To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,

df = pandas.read_csv(fileName, sep=\'delimiter\', header=None)

In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: \"If file contains no header row, then you should explicitly pass header=None\". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.

According to the docs, the delimiter thing should not be an issue. The docs say that \"if sep is None [not specified], will try to automatically determine this.\" I however have not had good luck with this, including instances with obvious delimiters.



回答3:

The parser is getting confused by the header of the file. It reads the first row and infers the number of columns from that row. But the first two rows aren\'t representative of the actual data in the file.

Try it with data = pd.read_csv(path, skiprows=2)



回答4:

Your CSV file might have variable number of columns and read_csv inferred the number of columns from the first few rows. Two ways to solve it in this case:

1) Change the CSV file to have a dummy first line with max number of columns (and specify header=[0])

2) Or use names = list(range(0,N)) where N is the max number of columns.



回答5:

I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:

data = pd.read_csv(\'file1.csv\', error_bad_lines=False)

If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv(\'file1.csv\',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split(\'.\')[0].strip()                                
        if errortype == \'Error tokenizing data\':                        
           cerror      = e.message.split(\':\')[1].strip().replace(\',\',\'\')
           nums        = [n for n in cerror.split(\' \') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = \'Unknown\'
           print \'Unknown Error - 222\'

if line != []:
    # Handle the errors however you want

I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable \'line\' in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.



回答6:

This is definitely an issue of delimiter, as most of the csv CSV are got create using sep=\'/t\' so try to read_csv using the tab character (\\t) using separator /t. so, try to open using following code line.

data=pd.read_csv(\"File_path\", sep=\'\\t\')


回答7:

I\'ve had this problem a few times myself. Almost every time, the reason is that the file I was attempting to open was not a properly saved CSV to begin with. And by \"properly\", I mean each row had the same number of separators or columns.

Typically it happened because I had opened the CSV in Excel then improperly saved it. Even though the file extension was still .csv, the pure CSV format had been altered.

Any file saved with pandas to_csv will be properly formatted and shouldn\'t have that issue. But if you open it with another program, it may change the structure.

Hope that helps.



回答8:

I came across the same issue. Using pd.read_table() on the same source file seemed to work. I could not trace the reason for this but it was a useful workaround for my case. Perhaps someone more knowledgeable can shed more light on why it worked.

Edit: I found that this error creeps up when you have some text in your file that does not have the same format as the actual data. This is usually header or footer information (greater than one line, so skip_header doesn\'t work) which will not be separated by the same number of commas as your actual data (when using read_csv). Using read_table uses a tab as the delimiter which could circumvent the users current error but introduce others.

I usually get around this by reading the extra data into a file then use the read_csv() method.

The exact solution might differ depending on your actual file, but this approach has worked for me in several cases



回答9:

I\'ve had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:

1115794 4218    \"k__Bacteria\", \"p__Firmicutes\", \"c__Bacilli\", \"o__Bacillales\", \"f__Bacillaceae\", \"\"
1144102 3180    \"k__Bacteria\", \"p__Firmicutes\", \"c__Bacilli\", \"o__Bacillales\", \"f__Bacillaceae\", \"g__Bacillus\", \"\"
368444  2328    \"k__Bacteria\", \"p__Bacteroidetes\", \"c__Bacteroidia\", \"o__Bacteroidales\", \"f__Bacteroidaceae\", \"g__Bacteroides\", \"\"



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep=\'\\t\', index_col=2, header=None, engine = \'c\')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything

counts = pd.read_table(path_counts, sep=\'\\t\', index_col=2, header=None, engine=\'python\')

Segmentation fault (core dumped)

Now that is a different error.
If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:

1115794 4218    \"k__Bacteria\",\"p__Firmicutes\",\"c__Bacilli\",\"o__Bacillales\",\"f__Bacillaceae\",\"\"
1144102 3180    \"k__Bacteria\",\"p__Firmicutes\",\"c__Bacilli\",\"o__Bacillales\",\"f__Bacillaceae\",\"g__Bacillus\",\"\"
368444  2328    \"k__Bacteria\",\"p__Bacteroidetes\",\"c__Bacteroidia\",\"o__Bacteroidales\",\"f__Bacteroidaceae\",\"g__Bacteroides\",\"\"


_csv.Error: \'   \' expected after \'\"\'

And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.

To avoid creating a new file with replacements I did this, as my tables are small:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace(\'\", \"\"\', \'\').replace(\'\"\', \'\').replace(\', \', \',\').replace(\'\\0\',\'\'))
    counts = pd.read_table(input, sep=\'\\t\', index_col=2, header=None, engine=\'python\')

tl;dr
Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.



回答10:

Although not the case for this question, this error may also appear with compressed data. Explicitly setting the value for kwarg compression resolved my problem.

result = pandas.read_csv(data_source, compression=\'gzip\')


回答11:

following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):

df = pd.read_csv(filename, usecols=range(0, 42)) df.columns = [\'YR\', \'MO\', \'DAY\', \'HR\', \'MIN\', \'SEC\', \'HUND\', \'ERROR\', \'RECTYPE\', \'LANE\', \'SPEED\', \'CLASS\', \'LENGTH\', \'GVW\', \'ESAL\', \'W1\', \'S1\', \'W2\', \'S2\', \'W3\', \'S3\', \'W4\', \'S4\', \'W5\', \'S5\', \'W6\', \'S6\', \'W7\', \'S7\', \'W8\', \'S8\', \'W9\', \'S9\', \'W10\', \'S10\', \'W11\', \'S11\', \'W12\', \'S12\', \'W13\', \'S13\', \'W14\']

Following does NOT work:

df = pd.read_csv(filename, names=[\'YR\', \'MO\', \'DAY\', \'HR\', \'MIN\', \'SEC\', \'HUND\', \'ERROR\', \'RECTYPE\', \'LANE\', \'SPEED\', \'CLASS\', \'LENGTH\', \'GVW\', \'ESAL\', \'W1\', \'S1\', \'W2\', \'S2\', \'W3\', \'S3\', \'W4\', \'S4\', \'W5\', \'S5\', \'W6\', \'S6\', \'W7\', \'S7\', \'W8\', \'S8\', \'W9\', \'S9\', \'W10\', \'S10\', \'W11\', \'S11\', \'W12\', \'S12\', \'W13\', \'S13\', \'W14\'], usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54 Following does NOT work:

df = pd.read_csv(filename, header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Hence, in your problem you have to pass usecols=range(0, 2)



回答12:

Sometimes the problem is not how to use python, but with the raw data.
I got this error message

Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.

It turned out that in the column description there were sometimes commas. This means that the CSV file needs to be cleaned up or another separator used.



回答13:

use pandas.read_csv(\'CSVFILENAME\',header=None,sep=\', \')

when trying to read csv data from the link

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

I copied the data from the site into my csvfile. It had extra spaces so used sep =\', \' and it worked :)



回答14:

An alternative that I have found to be useful in dealing with similar parsing errors uses the CSV module to re-route data into a pandas df. For example:

import csv
import pandas as pd
path = \'C:/FileLocation/\'
file = \'filename.csv\'
f = open(path+file,\'rt\')
reader = csv.reader(f)

#once contents are available, I then put them in a list
csv_list = []
for l in reader:
    csv_list.append(l)
f.close()
#now pandas has no problem getting into a df
df = pd.DataFrame(csv_list)

I find the CSV module to be a bit more robust to poorly formatted comma separated files and so have had success with this route to address issues like these.



回答15:

I had a dataset with prexisting row numbers, I used index_col:

pd.read_csv(\'train.csv\', index_col=0)


回答16:

This is what I did.

sep=\'::\' solved my issue:

data=pd.read_csv(\'C:\\\\Users\\\\HP\\\\Downloads\\\\NPL ASSINGMENT 2 imdb_labelled\\\\imdb_labelled.txt\',engine=\'python\',header=None,sep=\'::\')


回答17:

I had a similar case as this and setting

train = pd.read_csv(\'input.csv\' , encoding=\'latin1\',engine=\'python\') 

worked



回答18:

Use delimiter in parameter

pd.read_csv(filename, delimiter=\",\", encoding=\'utf-8\')

It will read.



回答19:

I have the same problem when read_csv: ParserError: Error tokenizing data. I just saved the old csv file to a new csv file. The problem is solved!



回答20:

I had a similar error and the issue was that I had some escaped quotes in my csv file and needed to set the escapechar parameter appropriately.



回答21:

You can do this step to avoid the problem -

train = pd.read_csv(\'/home/Project/output.csv\' , header=None)

just add - header=None

Hope this helps!!



回答22:

Issue could be with file Issues, In my case, Issue was solved after renaming the file. yet to figure out the reason..



回答23:

I had received a .csv from a coworker and when I tried to read the csv using pd.read_csv(), I received a similar error. It was apparently attempting to use the first row to generate the columns for the dataframe, but there were many rows which contained more columns than the first row would imply. I ended up fixing this problem by simply opening and re-saving the file as .csv and using pd.read_csv() again.



回答24:

try: pandas.read_csv(path, sep = \',\' ,header=None)