I have downloaded this csv file, which creates a spreadsheet of gene information. What is important is that in the HLA-*
columns, there is gene information. If the gene is too low of a resolution e.g. DQB1*03
then the row should be deleted. If the data is too high resoltuion e.g. DQB1*03:02:01
, then the :01
tag at the end needs to be removed. So, ideally I want to proteins to be in the format DQB1*03:02
, so that it has two levels of resolution after DQB1*
. How can I tell python to look for these formats, and ignore the data stored in them.
e.g.
if (csvCell is of format DQB1*03:02:01):
delete the :01 # but do this in a general format
elif (csvCell is of format DQB1*03):
delete row
else:
goto next line
UPDATE: Edited code I referenced
import csv
import re
import sys
csvdictreader = csv.DictReader(open('mhc.csv','r+b'), delimiter=',')
csvdictwriter = csv.DictWriter(file('mhc_fixed.csv','r+b'), fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-D')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^\w+\*\d\d$', value):
keep = False
break # quit processing target fields
elif re.match(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$', value):
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
if keep:
csvdictwriter.writerow(rowfields)
Here's something that I think will do what you want. It's not as simple as Peter's answer because it uses Python's
csv
module to process the file. It could probably be rewritten and simplified to just treat the file as a plain text as his does, but that should be easy.The hardest part for me was determining what you wanted to do.
Here's an ultra-simple filter:
Or, if you want to use regular expressions
Run it as