I have large data set from the World Bank Millenium Development goals as a CSV. The data is displayed like this:
Country Code Country Name Indicator
ABW Aruba % Forest coverage
ADO Andorra % Forest coverage
AFG Afghanistan % Forest coverage
...
ABW Aruba % Literacy rate
ADO Andorra % Literacy rate
AFG Afghanistan % Literacy rate
...
ABW Aruba % Another indicator
ADO Andorra % Another indicator
AFG Afghanistan % Another indicator
The file is currently 8.2MB. I'm going to program a web interface for this data, and I'd like to slice the data by country so I can make an ajax request so I can load an individual CSV for each country.
I'm lost on how to do this programmatically or with any tool. I don't necessarily need Python but it's what I know best. I don't necessarily need a complete script, a general pointer on how to approach this problem is appreciated.
The original data source I'm working with is located here:
In shell scripting.
First,
awk '{print $1}' | sort | uniq > code.lst
will give you a list of country codes. Then you can iterate through country code and select all the lines ofyoufilename.csv
that matches the code using grep.One-liner:
This creates new files named
data_ABW
, etc., containing the appropriate information. TheNR>1
part skips the header line. Then, for each line, it appends that entire line ($0
) to the file namedData_$1
, where$1
is replaced with the text in the first column of that line. Finally, theclose
statement makes sure there aren't too many open files. If you didn't have so many countries, you could get rid of this and significantly increase the speed of the command.In answer to @Lenwood's comment below, to include the header in each output file, you can do this:
(You may have to escape the exclamation point...) The first new part
NR==1 {header=$0};
just stores the first line of the input file as the variableheader
. Then, the other new partif(! files[$1]) ... files[$1]=1};
uses the associative arrayfiles
to keep track of all whether or not it has put the header into a given file, and if not, it puts it in there.Note that this appends the files, so if those files already exist, they'll just get added to. Therefore, if you get new data in your main file, you'll probably want to delete those other files before you run this command again.
(In case it's not obvious, if you want the files to be named like
data_Aruba
you can change$1
to$2
.)You can use Python's csv module and itertools.groupby.
The following example was tested on Python 2.7.1
Edit: updated answer to account for new information added to question.
Other notes:
It is very simple with
pandas
Python data analysis library:Result