How do I slice a single CSV file into several smal

2020-03-24 05:02发布

I have large data set from the World Bank Millenium Development goals as a CSV. The data is displayed like this:

Country Code   Country Name   Indicator
ABW            Aruba          % Forest coverage
ADO            Andorra        % Forest coverage
AFG            Afghanistan    % Forest coverage
...
ABW            Aruba          % Literacy rate
ADO            Andorra        % Literacy rate
AFG            Afghanistan    % Literacy rate
...
ABW            Aruba          % Another indicator
ADO            Andorra        % Another indicator
AFG            Afghanistan    % Another indicator

The file is currently 8.2MB. I'm going to program a web interface for this data, and I'd like to slice the data by country so I can make an ajax request so I can load an individual CSV for each country.

I'm lost on how to do this programmatically or with any tool. I don't necessarily need Python but it's what I know best. I don't necessarily need a complete script, a general pointer on how to approach this problem is appreciated.

The original data source I'm working with is located here:

http://duopixel.com/stack/data.csv

标签: python csv
4条回答
家丑人穷心不美
2楼-- · 2020-03-24 05:15

In shell scripting.

First,awk '{print $1}' | sort | uniq > code.lst will give you a list of country codes. Then you can iterate through country code and select all the lines of youfilename.csv that matches the code using grep.

for c in `ls code.lst` do
   grep $c youfilename.csv > youfilename_$c.csv
done
查看更多
姐就是有狂的资本
3楼-- · 2020-03-24 05:20

One-liner:

awk -F "," 'NF>1 && NR>1 {print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv

This creates new files named data_ABW, etc., containing the appropriate information. The NR>1 part skips the header line. Then, for each line, it appends that entire line ($0) to the file named Data_$1, where $1 is replaced with the text in the first column of that line. Finally, the close statement makes sure there aren't too many open files. If you didn't have so many countries, you could get rid of this and significantly increase the speed of the command.

In answer to @Lenwood's comment below, to include the header in each output file, you can do this:

awk -F "," 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$1]) {print header >> ("data_" $1 ".csv"); files[$1]=1}; print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv

(You may have to escape the exclamation point...) The first new part NR==1 {header=$0}; just stores the first line of the input file as the variable header. Then, the other new part if(! files[$1]) ... files[$1]=1}; uses the associative array files to keep track of all whether or not it has put the header into a given file, and if not, it puts it in there.

Note that this appends the files, so if those files already exist, they'll just get added to. Therefore, if you get new data in your main file, you'll probably want to delete those other files before you run this command again.

(In case it's not obvious, if you want the files to be named like data_Aruba you can change $1 to $2.)

查看更多
爱情/是我丢掉的垃圾
4楼-- · 2020-03-24 05:23

You can use Python's csv module and itertools.groupby.
The following example was tested on Python 2.7.1
Edit: updated answer to account for new information added to question.

import csv, itertools as it, operator as op

csv_contents = []
with open('yourfile.csv', 'rb') as fin:
  dict_reader = csv.DictReader(fin)   # default delimiter is comma
  fieldnames = dict_reader.fieldnames # save for writing
  for line in dict_reader:            # read in all of your data
    csv_contents.append(line)         # gather data into a list (of dicts)

# input to itertools.groupby must be sorted by the grouping value 
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('Country Name'))

for groupkey, groupdata in it.groupby(sorted_csv_contents, 
                                      key=op.itemgetter('Country Name')):
  with open('slice_{:s}.csv'.format(groupkey), 'wb') as fou:
    dict_writer = csv.DictWriter(fou, fieldnames=fieldnames)
    dict_writer.writeheader()         # new method in 2.7; use writerow() in 2.6-
    dict_writer.writerows(groupdata)

Other notes:

  • You could use a regular csv reader and writer, but the DictReader and DictWriter are nice because you can reference columns by name.
  • Always use the 'b' flag when reading or writing .csv files because on Windows that makes a difference in how line-endings are handled.
  • If anything isn't clear let me know and I'll explain further!
查看更多
啃猪蹄的小仙女
5楼-- · 2020-03-24 05:27

It is very simple with pandas Python data analysis library:

from pandas.io.parsers import read_csv

df = read_csv(input_file, header=1, sep='\t', index_col=[0,1,2])
for (country_code, country_name), group in df.groupby(level=[0,1]):
    group.to_csv(country_code+'.csv')

Result

$ for f in *.csv ; do echo $f; cat $f; echo; done

ABW.csv
Country Code,Country Name,Indicator
ABW,Aruba,% Forest coverage
ABW,Aruba,% Literacy rate
ABW,Aruba,% Another indicator

ADO.csv
Country Code,Country Name,Indicator
ADO,Andorra,% Forest coverage
ADO,Andorra,% Literacy rate
ADO,Andorra,% Another indicator

AFG.csv
Country Code,Country Name,Indicator
AFG,Afghanistan,% Forest coverage
AFG,Afghanistan,% Literacy rate
AFG,Afghanistan,% Another indicator
查看更多
登录 后发表回答