I have data like below with SNP names (rs number or c#_pos#) included in gene names (e.g. ABCB9). In SNPs named as c#_pos000000, range of # is 1 to 22 (chromosome number)
ABCB9
rs11057374
rs7138100
c22_pos41422393
rs12309481
END
ABCC10
rs1214748
END
HDAC9
rs928578
rs10883039
END
HCN2
rs12428035
rs9561933
c2_pos102345
rs3848077
rs3099362
END
by using this data, I want to make the output like below
rs11057374 ABCB9
rs7138100 ABCB9
c22_pos41422393 ABCB9
rs12309481 ABCB9
rs1214748 ABCC10
rs928578 HDAC9
rs10883039 HDAC9
rs12428035 HCN2
rs9561933 HCN2
c2_pos102345 HCN2
rs3848077 HCN2
rs3099362 HCN2
It is not necessary whether there are blank and "END"
How make the this output in R or linux?
Rather than working from processed files, use raw files to get SNP Gene mapping. As you mentioned this data is output of plink command below:
So we already have gene.list and mydata.map files. Using those 2 files we can do below:
Also, see this post for more merge by overlap examples/functions.
We can do this slightly differently. After reading the file with
readLines
and removing the leading/lagging spaces (trimws
),split
the 'lines1' based on the grouping vector creating based on blank values (""
), remove the""
or"END"
strings from thelist
elements, then set thenames
of thelist
with the first observation of eachlist
element (sapply(lst1,
[, 1)
) while extracting all other elements except the first one andstack
it.data