Efficient means of mass converting >600,000 differ

2019-06-13 08:10发布

I am trying to convert a map file for some SNP data I want to use from Affy ids to dbSNP rs ids (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/, specifically ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz).

I am trying to find an effective way to this. I have the annotation file for the Axiom Human Origins array from which the data comes from, so I know the proper ids.

I was wondering if anyone could suggest a good bash/Python/Perl based method to do this. It amounts to >600,000 different replacements. The idea I had in mind was the

sed -i 's/Affy#/rs#/g' filename

method, but I figure this would not be the most efficient. Any suggestions? Thanks!

标签： replace

1条回答

Juvenile、少年°

2楼-- · 2019-06-13 08:55

Python code, assuming your substitutions are stored in subs.csv:

import csv

subs = dict(csv.reader(open('subs.csv'), delimiter='\t'))
source = csv.reader(open('all_snp.map'), delimiter='\t')
dest = csv.writer(open('all_snp_out.map', 'wb'), delimiter='\t')

for row in source:
    row[1] = subs.get(row[1], row[1])
    dest.writerow(row)

The line row[1] = subs.get(row[1], row[1]): row[1] is the Affx column, and it replaces it with a dictionary lookup which either gets the rsNumber equivalent if there is one, or returns the original Affx bit if there isn't one.

0人赞添加讨论(0) 举报

Efficient means of mass converting >600,000 differ

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间