How to make a variable by extracting specific line

2019-07-14 01:35发布

问题:

I have data like below with SNP names (rs number or c#_pos#) included in gene names (e.g. ABCB9). In SNPs named as c#_pos000000, range of # is 1 to 22 (chromosome number)

ABCB9  
rs11057374  
rs7138100  
c22_pos41422393  
rs12309481  
END  

ABCC10  
rs1214748  
END  

HDAC9  
rs928578  
rs10883039  
END  

HCN2
rs12428035  
rs9561933  
c2_pos102345
rs3848077  
rs3099362    
END 

by using this data, I want to make the output like below

rs11057374        ABCB9  
rs7138100         ABCB9  
c22_pos41422393   ABCB9  
rs12309481        ABCB9  

rs1214748         ABCC10   

rs928578          HDAC9    
rs10883039        HDAC9    

rs12428035        HCN2     
rs9561933         HCN2      
c2_pos102345      HCN2      
rs3848077         HCN2      
rs3099362         HCN2  

It is not necessary whether there are blank and "END"

How make the this output in R or linux?

回答1:

We can do this slightly differently. After reading the file with readLines and removing the leading/lagging spaces (trimws), split the 'lines1' based on the grouping vector creating based on blank values (""), remove the "" or "END" strings from the list elements, then set the names of the list with the first observation of each list element (sapply(lst1,[, 1)) while extracting all other elements except the first one and stack it.

lines1 <- trimws(lines)
lst1 <- lapply(split(lines1, cumsum(lines1=="")), 
                function(x) x[!x %in% c("", "END")])

stack(setNames(lapply(lst1,`[`,-1), sapply(lst1, `[`,1)))
#            values    ind
#1       rs11057374  ABCB9
#2        rs7138100  ABCB9
#3  c22_pos41422393  ABCB9
#4       rs12309481  ABCB9
#5        rs1214748 ABCC10
#6         rs928578  HDAC9
#7       rs10883039  HDAC9
#8       rs12428035   HCN2
#9        rs9561933   HCN2
#10    c2_pos102345   HCN2
#11       rs3848077   HCN2
#12       rs3099362   HCN2

data

lines <- readLines("yourdata.txt") 


回答2:

Rather than working from processed files, use raw files to get SNP Gene mapping. As you mentioned this data is output of plink command below:

plink --file mydata --make-set gene.list --write-set

So we already have gene.list and mydata.map files. Using those 2 files we can do below:

library(data.table)
# gene list file
geneList <- data.table(
  chr = 1:2,
  start = c(10, 40),
  end = c(13, 45),
  gene = paste0("gene_",1:2))
#    chr start end   gene
# 1:   1    10  13 gene_1
# 2:   2    40  45 gene_2

# map file
map <- data.table(
  chr = c(1,1,1,2,2,2,3),
  snp = paste0("snp_",1:7),
  cm = 0,
  bp = c(10,11,15,40,41,49,100))

# prepare for merging, rename colnames to match gene list colnames
map <- map[, list(chr, start = bp, end = bp, snp)]
#    chr start end   snp
# 1:   1    10  10 snp_1
# 2:   1    11  11 snp_2
# 3:   1    15  15 snp_3
# 4:   2    40  40 snp_4
# 5:   2    41  41 snp_5
# 6:   2    49  49 snp_6
# 7:   3   100 100 snp_7

# set key for merging
setkey(map, chr, start, end)

# merge and susbset snp and gene columns
foverlaps(geneList, map)[, list(snp, gene)]
#      snp   gene
# 1: snp_1 gene_1
# 2: snp_2 gene_1
# 3: snp_4 gene_2
# 4: snp_5 gene_2

Also, see this post for more merge by overlap examples/functions.