Python Pandas Error tokenizing data

2019-01-01 00:15发布

I'm trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

I have tried to read the pandas docs, but found nothing.

My code is simple:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

How can I resolve this? Should I use the csv module or another language ?

File is from Morningstar

24条回答
千与千寻千般痛.
2楼-- · 2019-01-01 00:57

It might be an issue with

  • the delimiters in your data
  • the first row, as @TomAugspurger noted

To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,

df = pandas.read_csv(fileName, sep='delimiter', header=None)

In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.

According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.

查看更多
公子世无双
3楼-- · 2019-01-01 00:58

I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:

data = pd.read_csv('file1.csv', error_bad_lines=False)

If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

if line != []:
    # Handle the errors however you want

I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable 'line' in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.

查看更多
看风景的人
4楼-- · 2019-01-01 00:59

following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):

df = pd.read_csv(filename, usecols=range(0, 42)) df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

Following does NOT work:

df = pd.read_csv(filename, names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'], usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54 Following does NOT work:

df = pd.read_csv(filename, header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Hence, in your problem you have to pass usecols=range(0, 2)

查看更多
余欢
5楼-- · 2019-01-01 00:59

Use delimiter in parameter

pd.read_csv(filename, delimiter=",", encoding='utf-8')

It will read.

查看更多
高级女魔头
6楼-- · 2019-01-01 01:01

I had a similar error and the issue was that I had some escaped quotes in my csv file and needed to set the escapechar parameter appropriately.

查看更多
流年柔荑漫光年
7楼-- · 2019-01-01 01:01

Issue could be with file Issues, In my case, Issue was solved after renaming the file. yet to figure out the reason..

查看更多
登录 后发表回答