CSV to Python Dictionary with all column names?

2019-01-26 17:21发布

问题:

I'm still pretty new to using python to program from scratch so as an exercise I though I'd take a file that I process using SQL an try to duplicate the functionality using Python. It seems that I want to take my (compressed, zip) csv file and create a Dict of it (or maybe a dict of dicts?). When I use dict reader I get the 1st row as a key rather than each column as its own key? E.g.

import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file    = zipfile.ZipFile(sys.argv[0])
items_file  = zip_file.open('AllListing1RES.txt', 'rU')

for row in csv.DictReader(items_file,dialect='excel'):
    pass

Yields:

>>> for key in row:
        print 'key=%s, value=%s' % (key, row[key])

key=MLS_ACCT    PARCEL_ID   AREA    COUNTY  STREET_NUM  STREET_NAME CITY        ZIP STATUS  PROP_TYPE   LIST_PRICE  LIST_DATE   DOM DATE_MODIFIED   BATHS_HALF  BATHS_FULL  BEDROOMS    ACREAGE YEAR_BUILT  YEAR_BUILT_DESC OWNER_NAME  SOLD_DATE   WITHDRAWN_DATE  STATUS_DATE SUBDIVISION PENDING_DATE    SOLD_PRICE,  
value=492859    28-15-3-009-001.0000    200 JEFF    3828    ORLEANS RD  MOUNTAIN BROOK  35243   A   SFR 324900  3/3/2011    2   3/4/2011 12:04:11 AM    0   2   3   0   1968    EXIST   SPARKS          3/3/2011 11:54:56 PM    KNOLLWOOD

So what I'm looking for is a column for MLS_ACCT and a separate one for PARCEL_ID etc so I can then do things like average prices by all items that contain KNOLLWOOD in the SUBDIVISION field With a further sub section by date range, date sold etc.

I know well how to do it with SQL but As I said I'm tying to gain some Python skills here. I have been reading for the last few days but have yet to find any very simple illustrations on this sort of use case. Pointers to said docs would be appreciated. I realize I could use memory resident SQL-lite but again my desire is to get the Python approach learned.I've read some on Numpy and Scipy and have sage loaded but still can't find some useful illustrations since those tools seem focussed on arrays with only numbers as elements and I have a lot of string matching I need to do as well as date range calculations and comparisons.

Eventually I'll need to substitute values in the table (since I have dirty data), I do this now by having a "translate table" which contains all dirty variants and provides a "clean" answer for final use.

回答1:

Are you sure that this is a file with comma-separated values? It seems like the lines are being delimited by tabs.

If this is correct, specify a tab delimiter in the DictReader constructor.

for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    for key in row:
        print 'key=%s, value=%s' % (key, row[key])

Source: http://docs.python.org/library/csv.html



回答2:

Writing the operation in pure Python is certainly possible, but you'll have to choose your algorithms then. The row output you've posted above looks a whole lot like the parsing has gone wrong; in fact, it seems not to be a CSV at all, is it a TSV? Try passing delimiter='\t' or dialect=csv.excel_tab to DictReader.

Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.

An example of getting a column-oriented data structure (however, involving loading the whole file):

import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])

The intermediate steps of going to list and storing in a variable aren't necessary. The key is using zip (or its cousin itertools.izip) to transpose the table.

Then extracting column two from all rows with a certain criterion in column one:

matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)

When you do know the type of a column, it may make sense to parse it, using appropriate functions like datetime.datetime.strptime.



回答3:

At first glance it seems like your input might not actually be CSV, but maybe is tab delimited instead. Check out the docs at python.org, you can create a Dialect and use that to change the delimiter.

import csv
csv.register_dialect('exceltab', delimiter='\t')
for row in csv.DictReader(items_file,dialect='exceltab'):
    pass