Sort CSV using a key computed from two columns, gr

Python amateur here...let's say here I have snippet of an example csv file:

Country, Year, GDP, Population
Country1,2002,44545,24352
Country2,2004,14325,75677
Country3,2005,23132412,1345234
Country4,,2312421,12412

I need to sort the file by descending GDP per capita (GDP/Population) in a certain year, say, 2002, then grab the first 10 rows with the largest GDP per capita values.

So far, after I import the csv to a 'data' variable, I grab all the 2002 data without missing fields using:

data_2 = []
for row in data:
if row[1] == '2002' and row[2]!= ' ' and row[3] != ' ':
    data_2.append(row)

I need to find some way to sort data_2 by row[2]/row[3] descending, preferably without using a class, and then grab each entire row tied to each of the largest 10 values to then write to another csv. If someone could point me in the right direction I would be forever grateful as I've tried countless googles...

标签： python sorting csv

3条回答

Anthone

2楼-- · 2020-04-01 05:11

This is an approach that will enable you to do one scan of the file to get the top 10 for each country...

It is possible to do this without pandas by utilising the heapq module, the following is untested, but should be a base for you to refer to appropriate documentation and adapt for your purposes:

import csv
import heapq
from itertools import islice

freqs = {}
with open('yourfile') as fin:
    csvin = csv.reader(fin)
    rows_with_gdp = ([float(row[2]) / float(row[3])] + row for row in islice(csvin, 1, None) if row[2] and row[3])
    for row in rows_with_gdp:
        cnt = freqs.setdefault(row[2], [[]] * 10) # 2 = year, 10 = num to keep
        heapq.heappushpop(cnt, row)

for year, vals in freqs.iteritems():
    print year, [row[1:] for row in sorted(filter(None, vals), reverse=True)]

0人赞添加讨论(0) 举报

叛逆

3楼-- · 2020-04-01 05:11

The relevant modules would be:

csv for parsing the input
collections.namedtuple to name the fields
the filter() function to extract the specified year range
heapq.nlargest() to find the largest values
pprint.pprint() for nice output

Here's a little bit to get you started (I would do it all but what is the fun in having someone write your whole program and deprive you of the joy of finishing it):

from __future__ import division
import csv, collections, heapq, pprint

filecontents = '''\
Country, Year, GDP, Population
Country1,2002,44545,24352
Country2,2004,14325,75677
Country3,2004,23132412,1345234
Country4,2004,2312421,12412
'''

CountryStats = collections.namedtuple('CountryStats', ['country', 'year', 'gdp', 'population'])
dialect = csv.Sniffer().sniff(filecontents)

data = []
for country, year, gdp, pop in csv.reader(filecontents.splitlines()[1:], dialect):
    row = CountryStats(country, int(year), int(gdp), int(pop))
    if row.year == 2004:
        data.append(row)

data.sort(key = lambda s: s.gdp / s.population)
pprint.pprint(data)

0人赞添加讨论(0) 举报

放我归山

4楼-- · 2020-04-01 05:14

Use the optional key argument to the sort function:

array.sort(key=lambda x: x[2])

will sort array using its third element as a key. The value of the key argument should be a lambda expression that takes in a single argument (an arbitrary element of the array being sorted) and returns the key for sorting.

For your GDP example, the lambda function to use would be:

lambda x: float(x[2])/float(x[3]) # x[2] is GDP, x[3] is population

The float function converts the CSV fields from strings into floating point numbers. Since there are no guarantees that this will be successful (improper formatting, bad data, etc), I'd typically do this before sorting, when inserting stuff into the array. You should use floating point division here explicitly, as integer division won't give you the results you expect. If you find yourself doing this often, changing the behavior of the division operator is an option (http://www.python.org/dev/peps/pep-0238/ and related links).

0人赞添加讨论(0) 举报

Sort CSV using a key computed from two columns, gr

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间