Reading bad csv files with garbage values

2019-01-27 03:10发布

问题:

I wish to read a csv file which has the following format using pandas:

    atrrth
    sfkjbgksjg
    airuqghlerig
    Name         Roll
    airuqgorqowi
    awlrkgjabgwl
    AAA          67
    BBB          55
    CCC          07

As you can see, if I use pd.read_csv, I get the fairly obvious error:

 ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2

But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values

These are the 2 of the possible column names as given below :

Name : [Name , NAME , Name of student] 
Roll : [Rollno , Roll , ROLL]

How to achieve this?

回答1:

Open the csv file and find a row from where the column name starts:

with open(r'data.csv') as fp:
    skip = next(filter(
        lambda x: x[1].startswith(('Name','NAME')),
        enumerate(fp)
    ))[0]

The value will be stored in skip parameter

import pandas as pd
df = pd.read_csv('data.csv', skiprows=skip)

Works in Python 3.X



回答2:

I would like to suggest a slight modification/simplification to @RahulAgarwal's answer. Rather than closing and re-opening the file, you can continue loading the same stream directly into pandas. Instead of recording the number of rows to skip, you can record the header line and split it manually to provide the column names:

with open(r'data.csv') as fp:
    names = next(line for line in fp if line.casefold().lstrip().startswith('name'))
    df = pd.read_csv(fp, names=names.strip().split())

This has an advantage for files with large numbers of trash lines.

A more detailed check could be something like this:

def isheader(line):
    items = line.strip().split()
    if len(items) != 2:
        return False
    items = sorted(map(str.casefold, items))
    return items[0].startswith('name') and items[1].startswith('roll')

This function will handle all your possibilities, in any order, but also currently skip trash lines with spaces in them. You would use it as a filter:

names = next(line for line in fp if isheader(line))


回答3:

If that's indeed the structure (and not just an example of what sort of garbage one can get), you can simply use skiprows argument to indicate how many lines should be skipped. In other words, you should read your dataframe like this:

import pandas as pd

df = pd.read_csv('your.csv', skiprows=3)

Mind that skiprows can do much more. Check the docs.