Loop for Parsing complex tab delimited/csv files i

2019-04-11 01:33发布

Just to be clear, I'm very new to programming and I'm using Python 3.3! Right now I have a lot files in the same basic layout. Each file has 9 columns, tab delimited and a variable number of header lines - most have five lines though. There are NO headings for the rows or columns!

Looks something like this:

#header1
#header2
#header3
#header4
#header5
ID1    asdf    asdk    asdfk    asdfkl    adsfkln    askdlfn   safsda    asdf    Notes1..
ID2    asdf    asdk    asdfk    asdfkl    adsfkln    askdlfn   safsda    asdf    Notes2..
ID3    asdf    asdk    asdfk    asdfkl    adsfkln    askdlfn   safsda    asdf    Notes3..
ID4    asdf    asdk    asdfk    asdfkl    adsfkln    askdlfn   safsda    asdf    Notes4..

The only information that I want is the first column, which contains the IDs, and the last column which contains notes about each ID. I'm shooting for a dictionary something like this

{'ID1': [notes1...]
 'ID2': [notes2...]....
 'ID1234': [notes1234...]}

But I would be happy with a list of dictionaries as well or something like that.

So I started by turning the text into a list of lists so that I can look up entries by index:

import csv

list_all = list(csv.reader(open(r'complex_tabbed_file.gff', 'rb'), delimiter='\t'))

d = dict()
ID = data[5][0]     #starting at 5 to skip the header lines
notes = data[5][8]
d[ID]= notes

print (d)

This gives me the info I am looking for but only reads one entry at I time. I need to create a loop that will read through the entire file which contains hundreds of entries..suggestions on a starting point?

I researched and found this: Read specific columns from a csv file with csv module?

which describes a similar situation but the coding is a little over my head. As I'm a NEWBIE, I'm having a hard time applying this example to my particular case =(

Here's what I have tried as far as iteration:

i=0

if i < 4:
    i= i+1

if i >= 5:
    ID = list_all[i][0]
    notes = list_all[i][8] 
    i= i+1

print (d)

This returns an empty dictionary ( d={ } ) No good.

Also tried

d = dict()  
i=5
for line in list_all: 
    ID = list_all[i][0]
    notes = list_all[i][8] 
    i = i+1

print (d)

which gives the oh so lovely "list index out of range" error message. I would really appreciate any suggestions, thanks!

3条回答
闹够了就滚
2楼-- · 2019-04-11 01:58

You can solve it iterating over each row and discard those that only have one field (headers):

import csv
import sys

d = dict()

with open(sys.argv[1], newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    for row in csvreader:
        if len(row) == 1: continue
        _d = {row[0]: [row[-1]]}
        d.update(_d)

print(d)

Run it like:

python3 script.py infile

That yields:

{
    'ID4': ['Notes4..'], 
    'ID1': ['Notes1..'], 
    'ID2': ['Notes2..'], 
    'ID3': ['Notes3..']
}
查看更多
疯言疯语
3楼-- · 2019-04-11 02:04

Reading your code does make me wonder whether you read the docs or not? The first, tiny example loops over all the entries/rows...: http://docs.python.org/2/library/csv.html

Anyway, looking into it the csv module has no means of filtering out comments, but you can use the python's own filter:

import csv
d = dict()
f = file('data.csv')
data = csv.reader(filter(lambda row: row[0]!='#', f), delimiter='\t')
for row in data:
  #print row
  d.update({row[0]: row[1:]})
f.close()
print(d)

You could possibly look into using DictReader instead of reader too...

查看更多
ゆ 、 Hurt°
4楼-- · 2019-04-11 02:06

Sometimes it is easier to skip the csv module entirely:

from pprint import pprint
d = dict()
with open('complex_tabbed_file.gff') as input_file:
  for line in input_file:
    line = line.split('\t')
    if len(line) > 1:
      d[line[0]] = [line[-1].strip()]

pprint(d)
查看更多
登录 后发表回答