I'm trying to parse through a csv file and extract the data from only specific columns.
Example csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
I'm trying to capture only specific columns, say ID
, Name
, Zip
and Phone
.
Code I've looked at has led me to believe I can call the specific column by its corresponding number, so ie: Name
would correspond to 2
and iterating through each row using row[2]
would produce all the items in column 2. Only it doesn't.
Here's what I've done so far:
import sys, argparse, csv
from settings import *
# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
fromfile_prefix_chars="@" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file
# open csv file
with open(csv_file, 'rb') as csvfile:
# get number of columns
for line in csvfile.readlines():
array = line.split(',')
first_item = array[0]
num_columns = len(array)
csvfile.seek(0)
reader = csv.reader(csvfile, delimiter=' ')
included_cols = [1, 2, 6, 7]
for row in reader:
content = list(row[i] for i in included_cols)
print content
and I'm expecting that this will print out only the specific columns I want for each row except it doesn't, I get the last column only.
The only way you would be getting the last column from this code is if you don't include your print statement in your
for
loop.This is most likely the end of your code:
You want it to be this:
Now that we have covered your mistake, I would like to take this time to introduce you to the pandas module.
Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:
so if you wanted to save all of the info in your column
Names
into a variable, this is all you need to do:It's a great module and I suggest you look into it. If for some reason your print statement was in
for
loop and it was still only printing out the last column, which shouldn't happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!With pandas you can use
read_csv
withusecols
parameter:Example:
To fetch column name, instead of using readlines() better use readline() to avoid loop & reading the complete file & storing it in the array.
You can use
numpy.loadtext(filename)
. For example if this is your database.csv
:And you want the
Name
column:More easily you can use
genfromtext
:Thanks to the way you can index and subset a pandas dataframe, a very easy way to extract a single column from a csv file into a variable is:
A few things to consider:
The snippet above will produce a pandas
Series
and notdataframe
. The suggestion from ayhan withusecols
will also be faster if speed is an issue. Testing the two different approaches using%timeit
on a 2122 KB sized csv file yields22.8 ms
for the usecols approach and53 ms
for my suggested approach.And don't forget
import pandas as pd
With a file like
Will output
Or alternatively if you want numerical indexing for the columns:
To change the deliminator add
delimiter=" "
to the appropriate instantiation, i.ereader = csv.reader(f,delimiter=" ")