I have data like below with SNP names (rs number or c#_pos#) included in gene names (e.g. ABCB9). In SNPs named as c#_pos000000, range of # is 1 to 22 (chromosome number)
ABCB9
rs11057374
rs7138100
c22_pos41422393
rs12309481
END
ABCC10
rs1214748
END
HDAC9
rs928578
rs10883039
END
HCN2
rs12428035
rs9561933
c2_pos102345
rs3848077
rs3099362
END
by using this data, I want to make the output like below
rs11057374 ABCB9
rs7138100 ABCB9
c22_pos41422393 ABCB9
rs12309481 ABCB9
rs1214748 ABCC10
rs928578 HDAC9
rs10883039 HDAC9
rs12428035 HCN2
rs9561933 HCN2
c2_pos102345 HCN2
rs3848077 HCN2
rs3099362 HCN2
It is not necessary whether there are blank and "END"
How make the this output in R or linux?
We can do this slightly differently. After reading the file with readLines
and removing the leading/lagging spaces (trimws
), split
the 'lines1' based on the grouping vector creating based on blank values (""
), remove the ""
or "END"
strings from the list
elements, then set the names
of the list
with the first observation of each list
element (sapply(lst1,
[, 1)
) while extracting all other elements except the first one and stack
it.
lines1 <- trimws(lines)
lst1 <- lapply(split(lines1, cumsum(lines1=="")),
function(x) x[!x %in% c("", "END")])
stack(setNames(lapply(lst1,`[`,-1), sapply(lst1, `[`,1)))
# values ind
#1 rs11057374 ABCB9
#2 rs7138100 ABCB9
#3 c22_pos41422393 ABCB9
#4 rs12309481 ABCB9
#5 rs1214748 ABCC10
#6 rs928578 HDAC9
#7 rs10883039 HDAC9
#8 rs12428035 HCN2
#9 rs9561933 HCN2
#10 c2_pos102345 HCN2
#11 rs3848077 HCN2
#12 rs3099362 HCN2
data
lines <- readLines("yourdata.txt")
Rather than working from processed files, use raw files to get SNP Gene mapping. As you mentioned this data is output of plink command below:
plink --file mydata --make-set gene.list --write-set
So we already have gene.list and mydata.map files. Using those 2 files we can do below:
library(data.table)
# gene list file
geneList <- data.table(
chr = 1:2,
start = c(10, 40),
end = c(13, 45),
gene = paste0("gene_",1:2))
# chr start end gene
# 1: 1 10 13 gene_1
# 2: 2 40 45 gene_2
# map file
map <- data.table(
chr = c(1,1,1,2,2,2,3),
snp = paste0("snp_",1:7),
cm = 0,
bp = c(10,11,15,40,41,49,100))
# prepare for merging, rename colnames to match gene list colnames
map <- map[, list(chr, start = bp, end = bp, snp)]
# chr start end snp
# 1: 1 10 10 snp_1
# 2: 1 11 11 snp_2
# 3: 1 15 15 snp_3
# 4: 2 40 40 snp_4
# 5: 2 41 41 snp_5
# 6: 2 49 49 snp_6
# 7: 3 100 100 snp_7
# set key for merging
setkey(map, chr, start, end)
# merge and susbset snp and gene columns
foverlaps(geneList, map)[, list(snp, gene)]
# snp gene
# 1: snp_1 gene_1
# 2: snp_2 gene_1
# 3: snp_4 gene_2
# 4: snp_5 gene_2
Also, see this post for more merge by overlap examples/functions.