Filtering a CSV file in python

I have downloaded this csv file, which creates a spreadsheet of gene information. What is important is that in the HLA-* columns, there is gene information. If the gene is too low of a resolution e.g. DQB1*03 then the row should be deleted. If the data is too high resoltuion e.g. DQB1*03:02:01, then the :01 tag at the end needs to be removed. So, ideally I want to proteins to be in the format DQB1*03:02, so that it has two levels of resolution after DQB1*. How can I tell python to look for these formats, and ignore the data stored in them. e.g.

if (csvCell is of format DQB1*03:02:01):
   delete the :01 # but do this in a general format
elif (csvCell is of format DQB1*03):
   delete row
else:
   goto next line

UPDATE: Edited code I referenced

import csv
import re
import sys

csvdictreader = csv.DictReader(open('mhc.csv','r+b'), delimiter=',')
csvdictwriter = csv.DictWriter(file('mhc_fixed.csv','r+b'), fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-D')]

for rowfields in csvdictreader:
  keep = True
  for field in targets:
    value = rowfields[field]
    if re.match(r'^\w+\*\d\d$', value):
      keep = False
      break # quit processing target fields
    elif re.match(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$', value):
      rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
    else: # reduce gene resolution if too high
              # by only keeping first two alles if three are present
      rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
  if keep:
     csvdictwriter.writerow(rowfields)

标签： python csv formatting bioinformatics

2条回答

女痞

2楼-- · 2020-03-31 03:26

Here's something that I think will do what you want. It's not as simple as Peter's answer because it uses Python's csv module to process the file. It could probably be rewritten and simplified to just treat the file as a plain text as his does, but that should be easy.

import csv
import re
import sys

csvdictreader = csv.DictReader(sys.stdin, delimiter=',')
csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]

for rowfields in csvdictreader:
    keep = True
    for field in targets:
        value = rowfields[field]
        if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?
            keep = False
            break # quit processing target fields
        else: # reduce gene resolution if too high
              # by only keeping first two alles if three are present
            rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',
                                      r'DQB1*\1:\2', value)
    if keep:
        csvdictwriter.writerow(rowfields)

The hardest part for me was determining what you wanted to do.

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2020-03-31 03:38

Here's an ultra-simple filter:

import sys

for line in sys.stdin:
  line = line.replace( ',DQB1*03:02:01,', ',DQB1*03:02,' )

  if line.find( ',DQB1*03,' ) == -1:
    sys.stdout.write( line )

Or, if you want to use regular expressions

import re
import sys

for line in sys.stdin:
  line = re.sub( ',DQB1\\*03:02:01,', ',DQB1*03:02,', line )
  if re.search( ',DQB1\\*03,', line ) == None:
    sys.stdout.write( line )

Run it as

python script.py < data.csv

0人赞添加讨论(0) 举报

Filtering a CSV file in python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间