自合并功能中的R(Custom Merge Function in R)

2019-10-16 19:36发布

我有一个大的数据集,我想编写自定义合并功能与适用于使用,但我解决不了一定的问题。 我不能用一个循环,因为它会花费太长的时间。 数据大致是这样的;

#     [ Name, Strand, Start, End ]

R1 = c( 'GeneA', '+', 1000, 1500 )
R2 = c( 'GeneA', '+', 1510, 2000 )
R3 = c( 'GeneA', '+', 2001, 2500 )
R4 = c( 'GeneB', '-', 3100, 4000 )

中的数据是与行R1一个data.frame:R4

到目前为止,我能得到哪些比较Ri和Rj的功能(J = +1),并合并他们,如果名称是一样的,钢绞线是一样的,他们之间的差距是小于100。

GAP = Ri[End] - Rj[Start]

如果我申请的功能,每一行,建立输出。 该函数然后输出应该创建

R1 = c( 'GeneA', '+', 1000, 2500 )
R4 = c( 'GeneB', '-', 3100, 4000 )

我能得到哪些可以使用适用于合并两个连续的元素融入其中工作的功能,但我无法弄清楚如何合并连续的三个要素。 一个丑陋的解决方案是,直到没有额外的合并是由运行连续两个元素合并功能,但这不是有效的。 我坚持的想法有点和任何见解将不胜感激。

编辑:澄清,该数据是通过在连续的染色体起始位置排序(未示出),并且每个基因名称在数据中设置在不同的位置出现多次(即基因A可以是1000-2500,然后再在4000-5000。我不希望合并这两个基因,只有连续的)。

编辑2:我已经使用以下添P溶液。 一个问题出现在合并的效率。 也就是有没有办法上传文本文件堆栈溢出,所以我可以显示什么数据真的看起来像,然后张贴我的脚本这么远?

 # Let ValidMerge be a logical vector of MergeDown corresponding to the rows in the data
 # as defined by TimP below
 # The data.frame is called RMDB 
 # TeMerge function is previously defined to merge two rows Ri and Rj into one entry
 # which spans the start of Ri to the end of Rj (with same name and strand)

 COUNT = 0
 RMDB.OUT = 0

 while (COUNT < nrow(RMDB)) { # Cycle through all rows of RMDB
  COUNT = COUNT + 1

  # Is this position a merger start?
    # If yes, then returns position in startpt
    # which will be the end position in endpt
    # If no, returns 0

  Merge = match(COUNT,startpt,nomatch=0) 

  if ( Merge == 0 ){
    # No merge starts at this position
    RMDB.OUT = rbind(RMDB.OUT,RMDB[COUNT,]) # Append COUNT Row to output

  }else{
    # Merge COUNT row in RMDB to its endpoint 

      RMDB.OUT = rbind(RMDB.OUT,
                     TeMerge(RMDB[COUNT,],RMDB[endpt[Merge],]))

    #print(paste('Merging Row',COUNT,' and Row',endpt[Merge]))

    COUNT = endpt[Merge] # Move Count to the endpoint      
  }
}

这个脚本正好给我感兴趣的结果,唯一的问题是,我的data.frame有5 000 000项,我想用几个参数,间隙大小,比较的结果运行分析。 有没有我可以重写代码的这一部分更有效的方法吗? 一切直到这一点在合理的时间(〜几分钟)中运行。 这部分已经进行了3小时以上的数据的700 000子集。

编辑3:其具有所有的情况下在顶部(MIR3)的加标数据。 忽略列1:4,8,11:15。

dput(RMDB)结构(列表(V1 = C(3612L,318L,318L,318L,318L,318L,318L,318L,318L,318L,318L,318L,318L,318L,741L,444L,741L,407L,2059L, 407L,656L,2058L,257L,4051L,456L,351L,850L,335L,1000L,1566L,236L,588L,3877L,750L,2292L,783L,747L,666L,260L,1118L,341L,7010L,320L,7010L, 249L,458L,24L,631L,631L,875L),V2 = C(11.4,23,23,23,23,23,23,23,23,23,23,23,23,23,18.7,11.6,24.9 ,28.8,14.1,28.8,23.6,18.3,25,如图7所示,8.2,23.6,24.9,29.5,13.5,19.4,34.8,17.4,22.9,27.6,12,26.6,30.4,12.9,38.5,35.4,27.8,19.2 ,0,17.3,21.2,19.3,0,3.9,26.6,22.6),V3 = C(27,3.8,3.8,3.8,3.8,3.8,3.8,3.8,3.8,3.8,3.8,3.8,3.8,3.8, 4.5,0,0,7.3,0.3,7.3,12.2,4.9,0,0.4,0,1.8,15.1,12.7,0,3.6,3,7.5,14,9.4,0,14.1,4.1,4,1.4, 9.4,5.9,2.7,0,9.3,1,0,0,0,1.8,9.5),V4 = C(1.3,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,3.1,1.1,3,0.3%,3,3,0,0,0,0,3.6,0.8,2.7,0,2.8,0.4,0.8,1.6,6.4,0,3.8,3.7 ,0,0,2.6 ,1.5,2.5,2.4,1.4,2,5,0,0,0.6,0.5),V5 =结构(C(1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L, 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L, 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L),.Label = “CHR1” 的class = “因子”),V6 = C(10469L,20001L ,20210L,21000L,21201L,22000L,22201L,23000L,23201L,20000L,20001L,24001L,24205L,24405L,0L,30855L,30953L,31293L,31436L,31734L,32841L,33048L,33466L,33529L,34048L,34451L,34565L ,35217L,35367L,37045L,37733L,38059L,38256L,39465L,39624L,39925L,40333L,40629L,40736L,41380L,42370L,43243L,44836L,44877L,45887L,46079L,46217L,46416L,46553L,46893L),V7 = C(11447L,20200L,20400L,21200L,21400L,22200L,22400L,23200L,23400L,20001L,20200L,24200L,24400L,24600L,0L,30952L,31131L,31435L,31733L,31754L,33037L,33456L,33509L,34041L, 34108L,34560L,34921L,35366L,35499L,37431L,37861L,38191L,39464L,39623L,39924L,40294L,40626L ,40729L,40878L,42285L,42504L,44835L,44876L,45753L,45987L,46198L,46240L,46493L,46722L,47092L),V8 =结构(C(38L,37L,37L,37L,37L,37L,37L,37L, 37L,37L,37L,37L,37L,37L,36L,35L,34L,33L,32L,31L,30L,29L,28L,27L,26L,25L,24L,23L,22L,21L,20L,19L,18L, 17L,16L,15L,14L,13L,12L,11L,10L,9L,8L,7L,6L,5L,4L,3L,2L,1L),.Label = C( “(249203529)”,“(249203899) ”, “(249204128)”, “(249204381​​)”, “(249204423)”, “(249204634)”, “(249204868)”, “(249205745)”, “(249205786)”, “(249208117)”, “(249208336)”, “(249209743)”, “(249209892)”, “(249209995)”, “(249210327)”, “(249210697)”, “(249210998)”, “(249211157)”,“( 249212430) “ ”(249212760)“, ”(249213190)“, ”(249215122)“, ”(249215255)“, ”(249215700)“, ”(249216061)“, ”(249216513)“,”(249216580) ”, “(249217112)”, “(249217165)”, “(249217584)”, “(249218867)”, “(249218888)”, “(249219186)”, “(249219490)”, “(249219669)”, “(249219773)”, “(249235266)”, “(249239174)”),类= “因子”),V9 =结构(C(2L,1L,1L,2L, 2L,2L,2L,2L,1L,2L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,2L,1L,2L,2L,2L,1L, 1L,1L,1L,1L,1L,1L,1L,1L,2L,2L,2L,1L,1L,1L,1L,1L,1L,1L,2L,1L,2L),.Label = C(“+ ”, “C”)中,class = “因素”),V10 =结构(C(32L,23L,23L,23L,23L,23L,24L,23L,23L,23L,23L,23L,23L,23L,34L, 33L,27L,26L,1L,26L,22L,12L,5L,14L,13L,16L,30L,25L,2L,7L,16L,15L,29L,28L,3L,28L,15L,4L,18L,8L, 19L,10L,31L,10L,9L,11L,6L,17L,20L,21L),.Label = C( “AluJo”, “AluJr”, “AluSx”, “AluSz6”, “AluYc”, “AT_rich”, “Charlie5”, “ERVL-E-INT”, “L1M5”, “L1MA8”, “L1MA9”, “L1MB5”, “L1P1”, “L1PA6”, “L2a而”, “L2c的”, “LTR12F”,“LTR16C “ ”MamRep1527“, ”MER45A“, ”MER58A“, ”MIR“, ”MIR3“, ”MIR3a“, ”MIRb“, ”MIRC“, ”MLT1A“, ”MLT1E1A“, ”MLT1E1A-INT“,” MLT1J2 ”, “(TAAA)n” 个, “TAR1”, “(TC)N”, “XXXXX”),类= “因子”),V11 =结构(C(10L,13L,13L,13L,13L,13L, 13L,13L,13L,13L,13L,13L,13L,13L,14L,11L,9L,13L,12L,13L,13L,3L,12L,3L,3L,4L,9L,13L,12L, 1L,4L,4L,9L,9L,12L,9L,4L,12L,8L,8L,6L,3L,11L,3L,3L,3L,5L,7L,2L,1L),.Label = C(“DNA / HAT-查理 “ ”DNA /帽Tip100“, ”LINE / L1“, ”LINE / L2“, ”Low_complexity“, ”LTR“, ”LTR / ERV1“, ”LTR / ERVL“,” LTR / ERVL -MaLR”, “卫星/ TELO”, “Simple_repeat”, “SINE /铝”, “SINE / MIR”, “XXXXXXXX”),类= “因子”),V12 =结构(C(19L,5L,5L, 5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,2L,9L,8L,25L,2L,10L,9L,23L,2L,1L,15L,12L,1L,6L, 2L,11L,16L,13L,7L,2L,2L,8L,14L,1L,3L,26L,17L,18L,9L,21L,22L,24L,2L,4L,27L,20L),.Label = C( “(0)”, “1”, “(113)”, “(115)”, “(119)”, “(13)”, “131”, “173”, “2”, “218”, “2234”, “(231)”, “2705”, “2923”, “2970”, “3242”, “359”, “3715”, “(399)”, “(4)”, “5306”, “5334”, “5746”, “6167”, “67”, “(685)”, “7”),类= “因子”),V13 = C(1712L,143L,143L,143L,143L,143L, 143L,143L,143L,143L,143L,143L,143L,143L,162L,96L,349L,217L,298L,238L,216L,6174L,44L,6154L,3030L,3156L,448L,255L,133L,2623L,3375L, 2846L,1489L, 172L,301L,666L,3217L,312L,376L,4982L,499L,5305L,41L,6290L,5433L,6280L,24L,404L,178L,220L),V14 =结构(C(23L,24L,24L,24L,24L ,24L,24L,24L,24L,24L,24L,24L,24L,24L,8L,1L,10L,25L,4L,13L,21L,1L,11L,26L,15L,14L,20L,30L,5L,2L ,1L,27L,1L,18L,3L,4L,7L,6L,9L,19L,22L,29L,1L,2L,28L,16L,1L,17L,1L,12L),.Label = C(“(0 ) “ ”(1)“, ”(11)“, ”(14)“, ”(179)“, ”208“, ”(209)“, ”(212)“, ”232“,”(25 ) “ ”(255)“, ”3“, ”(30)“, ”3049“, ”(3116)“, ”(32)“, ”327“, ”(388)“, ”4016“,” 41" , “(46)”, “(470)”, “483”, “49”, “(51)”, “5640”, “(573)”, “(691)”, “(838)” , “91”),类= “因子”),V15 = C(2L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,22L,23L,22L ,24L,25L,24L,26L,27L,28L,29L,30L,31L,32L,33L,34L,35L,36L,37L,38L,38L,39L,38L,37L,40L,41L,42L,43L,44L ,45L,44L,46L,47L,48L,49L,50L,51L)),.Names = C( “V1”, “V2”, “V3”, “V4”, “V5”, “V6”,“V7 “ ”V8“, ”V9“, ”V10“, ”V11“, ”V12“, ”V13“, ”V14“, ”V15“),类=” data.frame ”,row.names = C(NA,-50L))

这里是应用ValidMerge后的输出。

dput(RMDB.OUT) structure(list(V1 = c(3612L, NA, NA, 318L, 318L, 318L, 318L, 318L, 318L, NA, 741L, 444L, 741L, 407L, 2059L, 407L, 656L, 2058L, 257L, 4051L, 456L, 351L, 850L, 335L, 1000L, 1566L, 236L, 588L, 3877L, 750L, 2292L, 783L, 747L, 666L, 260L, 1118L, 341L, 7010L, 320L, 7010L, 249L, 458L, 24L, 631L, 631L, 875L), V2 = c(11.4, NA, NA, 23, 23, 23, 23, 23, 23, NA, 18.7, 11.6, 24.9, 28.8, 14.1, 28.8, 23.6, 18.3, 25, 7, 8.2, 23.6, 24.9, 29.5, 13.5, 19.4, 34.8, 17.4, 22.9, 27.6, 12, 26.6, 30.4, 12.9, 38.5, 35.4, 27.8, 19.2, 0, 17.3, 21.2, 19.3, 0, 3.9, 26.6, 22.6), V3 = c(27, NA, NA, 3.8, 3.8, 3.8, 3.8, 3.8, 3.8, NA, 4.5, 0, 0, 7.3, 0.3, 7.3, 12.2, 4.9, 0, 0.4, 0, 1.8, 15.1, 12.7, 0, 3.6, 3, 7.5, 14, 9.4, 0, 14.1, 4.1, 4, 1.4, 9.4, 5.9, 2.7, 0, 9.3, 1, 0, 0, 0, 1.8, 9.5 ), V4 = c(1.3, NA, NA, 0, 0, 0, 0, 0, 0, NA, 0, 3.1, 1.1, 3, 0.3, 3, 3, 0, 0, 0, 0, 3.6, 0.8, 2.7, 0, 2.8, 0.4, 0.8, 1.6, 6.4, 0, 3.8, 3.7, 0, 0, 2.6, 1.5, 2.5, 2.4, 1.4, 2, 5, 0, 0, 0.6, 0.5), V5 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "chr1", class = "factor"), V6 = c(10469L, 20001L, 21000L, 22000L, 22201L, 23000L, 23201L, 20000L, 20001L, 24001L, 0L, 30855L, 30953L, 31293L, 31436L, 31734L, 32841L, 33048L, 33466L, 33529L, 34048L, 34451L, 34565L, 35217L, 35367L, 37045L, 37733L, 38059L, 38256L, 39465L, 39624L, 39925L, 40333L, 40629L, 40736L, 41380L, 42370L, 43243L, 44836L, 44877L, 45887L, 46079L, 46217L, 46416L, 46553L, 46893L), V7 = c(11447L, 20400L, 21400L, 22200L, 22400L, 23200L, 23400L, 20001L, 20200L, 24600L, 0L, 30952L, 31131L, 31435L, 31733L, 31754L, 33037L, 33456L, 33509L, 34041L, 34108L, 34560L, 34921L, 35366L, 35499L, 37431L, 37861L, 38191L, 39464L, 39623L, 39924L, 40294L, 40626L, 40729L, 40878L, 42285L, 42504L, 44835L, 44876L, 45753L, 45987L, 46198L, 46240L, 46493L, 46722L, 47092L), V8 = structure(c(38L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 36L, 35L, 34L, 33L, 32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 23L, 22L, 21L, 20L, 19L, 18L, 17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("(249203529)", "(249203899)", "(249204128)", "(249204381)", "(249204423)", "(249204634)", "(249204868)", "(249205745)", "(249205786)", "(249208117)", "(249208336)", "(249209743)", "(249209892)", "(249209995)", "(249210327)", "(249210697)", "(249210998)", "(249211157)", "(249212430)", "(249212760)", "(249213190)", "(249215122)", "(249215255)", "(249215700)", "(249216061)", "(249216513)", "(249216580)", "(249217112)", "(249217165)", "(249217584)", "(249218867)", "(249218888)", "(249219186)", "(249219490)", "(249219669)", "(249219773)", "(249235266)", "(249239174)"), class = "factor"), V9 = structure(c(2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("+", "C"), class = "factor"), V10 = structure(c(32L, 23L, 23L, 23L, 24L, 23L, 23L, 23L, 23L, 23L, 34L, 33L, 27L, 26L, 1L, 26L, 22L, 12L, 5L, 14L, 13L, 16L, 30L, 25L, 2L, 7L, 16L, 15L, 29L, 28L, 3L, 28L, 15L, 4L, 18L, 8L, 19L, 10L, 31L, 10L, 9L, 11L, 6L, 17L, 20L, 21L), .Label = c("AluJo", "AluJr", "AluSx", "AluSz6", "AluYc", "AT_rich", "Charlie5", "ERVL-E-int", "L1M5", "L1MA8", "L1MA9", "L1MB5", "L1P1", "L1PA6", "L2a", "L2c", "LTR12F", "LTR16C", "MamRep1527", "MER45A", "MER58A", "MIR", "MIR3", "MIR3a", "MIRb", "MIRc", "MLT1A", "MLT1E1A", "MLT1E1A-int", "MLT1J2", "(TAAA)n", "TAR1", "(TC)n", "XXXXX"), class = "factor"), V11 = structure(c(10L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 14L, 11L, 9L, 13L, 12L, 13L, 13L, 3L, 12L, 3L, 3L, 4L, 9L, 13L, 12L, 1L, 4L, 4L, 9L, 9L, 12L, 9L, 4L, 12L, 8L, 8L, 6L, 3L, 11L, 3L, 3L, 3L, 5L, 7L, 2L, 1L), .Label = c("DNA/hAT-Charlie", "DNA/hAT-Tip100", "LINE/L1", "LINE/L2", "Low_complexity", "LTR", "LTR/ERV1", "LTR/ERVL", "LTR/ERVL-MaLR", "Satellite/telo", "Simple_repeat", "SINE/Alu", "SINE/MIR", "XXXXXXXX"), class = "factor"), V12 = structure(c(19L, NA, NA, 5L, 5L, 5L, 5L, 5L, 5L, NA, 2L, 9L, 8L, 25L, 2L, 10L, 9L, 23L, 2L, 1L, 15L, 12L, 1L, 6L, 2L, 11L, 16L, 13L, 7L, 2L, 2L, 8L, 14L, 1L, 3L, 26L, 17L, 18L, 9L, 21L, 22L, 24L, 2L, 4L, 27L, 20L), .Label = c("(0)", "1", "(113)", "(115)", "(119)", "(13)", "131", "173", "2", "218", "2234", "(231)", "2705", "2923", "2970", "3242", "359", "3715", "(399)", "(4)", "5306", "5334", "5746", "6167", "67", "(685)", "7"), class = "factor"), V13 = c(1712L, NA, NA, 143L, 143L, 143L, 143L, 143L, 143L, NA, 162L, 96L, 349L, 217L, 298L, 238L, 216L, 6174L, 44L, 6154L, 3030L, 3156L, 448L, 255L, 133L, 2623L, 3375L, 2846L, 1489L, 172L, 301L, 666L, 3217L, 312L, 376L, 4982L, 499L, 5305L, 41L, 6290L, 5433L, 6280L, 24L, 404L, 178L, 220L), V14 = structure(c(23L, NA, NA, 24L, 24L, 24L, 24L, 24L, 24L, NA, 8L, 1L, 10L, 25L, 4L, 13L, 21L, 1L, 11L, 26L, 15L, 14L, 20L, 30L, 5L, 2L, 1L, 27L, 1L, 18L, 3L, 4L, 7L, 6L, 9L, 19L, 22L, 29L, 1L, 2L, 28L, 16L, 1L, 17L, 1L, 12L), .Label = c("(0)", "(1)", "(11)", "(14)", "(179)", "208", "(209)", "(212)", "232", "(25)", "(255)", "3", "(30)", "3049", "(3116)", "(32)", "327", "(388)", "4016", "41", "(46)", "(470)", "483", "49", "(51)", "5640", "(573)", "(691)", "(838)", "91"), class = "factor"), V15 = c(2L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 22L, 23L, 22L, 24L, 25L, 24L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 38L, 39L, 38L, 37L, 40L, 41L, 42L, 43L, 44L, 45L, 44L, 46L, 47L, 48L, 49L, 50L, 51L)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12", "V13", "V14", "V15"), class = "data.frame", row.names = c(NA, -46L))

编辑4:对不起,这里是简化版本:初始Data.frame

dput(RMDB.cut)

结构(列表(染色体=结构(C(1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L) ,.Label = “CHR1” 的class = “因子”),起始= C(10469L,20001L,20210L,21000L,21201L,22000L,22201L,23000L,23201L,20000L,20001L,24001L,24205L,24405L,0L,30855L ,30953L,31293L,31436L,31734L),结束= C(11447L,20200L,20400L,21200L,21400L,22200L,22400L,23200L,23400L,20001L,20200L,24200L,24400L,24600L,0L,30952L,31131L,31435L, 31733L,31754L), (Left) =结构(C(38L,37L,37L,37L,37L,37L,37L,37L,37L,37L,37L,37L,37L,37L,36L,35L,34L,33L,32L ,31L),.Label = C( “(249203529)”, “(249203899)”, “(249204128)”, “(249204381​​)”, “(249204423)”, “(249204634)”, “(249204868)” “(249205745)”, “(249205786)”, “(249208117)”, “(249208336)”, “(249209743)”, “(249209892)”, “(249209995)”, “(249210327)”,“ (249210697) “ ”(249210998)“, ”(249211157)“, ”(249212430)“, ”(249212760)“, ”(249213190)“, ”(249215122)“, ”(249215255)“,”(249215700 ) “ ”(249216061)“,”(2 49216513) “ ”(249216580)“, ”(249217112)“, ”(249217165)“, ”(249217584)“, ”(249218867)“, ”(249218888)“, ”(249219186)“,”(249219490) ”, “(249219669)”, “(249219773)”, “(249235266)”, “(249239174)”),类= “因子”),链=结构(C(2L,1L,1L,2L,2L, 2L,2L,2L,1L,2L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L),.Label = C( “+”, “C”)中,class = “因子” ),repName =结构(C(32L,23L,23L,23L,23L,23L,24L,23L,23L,23L,23L,23L,23L,23L,34L,33L,27L,26L,1L,26L)。标签= C( “AluJo”, “AluJr”, “AluSx”, “AluSz6”, “AluYc”, “AT_rich”, “Charlie5”, “ERVL-E-INT”, “L1M5”, “L1MA8”,“L1MA9 ”, “L1MB5”, “L1P1”, “L1PA6”, “L2a而”, “L2c的”, “LTR12F”, “LTR16C”, “MamRep1527”, “MER45A”, “MER58A”, “MIR”, “MIR3”, “MIR3a”, “MIRb”, “MIRC”, “MLT1A”, “MLT1E1A”, “MLT1E1A-INT”, “MLT1J2”, “(TAAA)n” 个, “TAR1”, “(TC)N”,“XXXXX “),类= “因子”),ValidMerge = C(FALSE,TRUE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE, FALSE,FALSE)),.Names = C(“CH romosome”, “开始”, “结束”, “(左)”, “海滩”, “repName”, “ValidMerge”),row.names = C(NA,20L)中,class = “data.frame”)

而合并后的输出

dput(RMDB.out.cut)

结构(列表(=染色体结构(C(1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L),.Label = “CHR1”,类= “因子”),起始= C( “10469”, “20001”, “21000”, “22000”, “22201”, “23000”, “23201”, “20000”, “20001”, “24001” , “0”, “30855”, “30953”, “31293”, “31436”, “31734”),结束= C( “11447”, “20400”, “21400”, “22200”, “22400”, “23200”, “23400”, “20001”, “20200”, “24600”, “0”, “30952”, “31131”, “31435”, “31733”, “31754”), (Left) =结构(C(38L,37L,37L,37L,37L,37L,37L,37L,37L,37L,36L,35L,34L,33L,32L,31L),.Label = C( “(249203529)”,“(249203899 )”, “(249204128)”, “(249204381​​)”, “(249204423)”, “(249204634)”, “(249204868)”, “(249205745)”, “(249205786)”, “(249208117)” “(249208336)”, “(249209743)”, “(249209892)”, “(249209995)”, “(249210327)”, “(249210697)”, “(249210998)”, “(249211157)”,“ (249212430) “ ”(249212760)“, ”(249213190)“, ”(249215122)“, ”(249215255)“, ”(249215700)“, ”(249216061)“, ”(249216513)“,”(249216580 ) “ ”(249217112)“, ”(249217165)“,”(249217584 )”, “(249218867)”, “(249218888)”, “(249219186)”, “(249219490)”, “(249219669)”, “(249219773)”, “(249235266)”, “(249239174)” ),类= “因子”),链=结构(C(2L,1L,2L,2L,2L,2L,1L,2L,1L,1L,1L,1L,1L,1L,1L,1L),.Label = C( “+”, “C”)中,class = “因素”),repName =结构(C(32L,23L,23L,23L,24L,23L,23L,23L,23L,23L,34L,33L,27L ,26L,1L,26L),.Label = C( “AluJo”, “AluJr”, “AluSx”, “AluSz6”, “AluYc”, “AT_rich”, “Charlie5”, “ERVL-E-INT”,“ L1M5" , “L1MA8”, “L1MA9”, “L1MB5”, “L1P1”, “L1PA6”, “L2a而”, “L2c的”, “LTR12F”, “LTR16C”, “MamRep1527”, “MER45A”, “MER58A” , “MIR”, “MIR3”, “MIR3a”, “MIRb”, “MIRC”, “MLT1A”, “MLT1E1A”, “MLT1E1A-INT”, “MLT1J2”, “(TAAA)n” 个, “TAR1”, “(TC)N”, “XXXXX”),类= “因子”),ValidMerge = C(FALSE,TRUE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE ,FALSE,FALSE)),.Names = C( “染色体”, “开始”, “结束”, “(左)”, “海滩”, “repName”, “ValidMerge”),row.names = C(” 2" , “3”, “4”, “6”, “7”, “8”, “9”, “10”,“1 1" , “111”, “15”, “16”, “17”, “18”, “19”, “20”)中,class = “data.frame”)

Answer 1:

我认为该战略应是产生所谓DoMerge另一列 - 和对每一行R_I(其中,i取值范围为1到n-1),DoMerge如果名称和绞线R_I和R_之间的匹配是TRUE {I + 1},和结束对R_I足够接近开始为R_ {I + 1}(内100中,如果这是正确的值)。 DoMerge为行n是按照惯例FALSE。 直观地说,DoMerge为TRUE意味着它是有效的合并该行与下一行。

然后,我们合并所有行,其中有TRUEs的连续串起来。 如果我们同意这是最好的策略,我可以敲起来了,一些简单的代码! :)

更新:

下面是任务代码,假设是myDF是与列名,钢绞线,开始和结束信息的数据帧...的下面的输出是在你需要合并的起点和终点 - 虽然一旦你知道了什么需要合并应该是不在话下:)

distanceThresh = 100
isSameName=(mydf$Name==c(mydf$Name[-1],"void"))
isSameStrand=(mydf$Strand==c(mydf$Strand[-1],"void"))
isWithinDistance=(c(mydf$Start[-1],max(mydf$End)+2*distanceThresh)
                  -mydf$End) <= distanceThresh
validMerge = isSameName & isSameStrand & isWithinDistance

fthent=which(!validMerge & c(validMerge[-1],FALSE))
tthenf=which(validMerge & !c(validMerge[-1],TRUE))
startpt = fthent+1; if (validMerge[1]) {startpt=c(1,startpt)}
endpt = tthenf+1

instructions=NULL
for (kk in seq_along(startpt)) {
instructions = c(instructions,
               paste("Merge",startpt[kk],"to",endpt[kk],"inclusive",sep=" "))
}

让我知道,如果这一切是有道理的! :)

归并方法(6月8日):

如何像这样(有过一些测试,但不是在真实数据)...

doMerge = unlist(mapply(function(x,y) {seq(x,y,1)},startpt,endpt))
doNotMerge = setdiff(seq_along(validMerge),doMerge)
dataMerged=data.frame(Name=RMDB$Name[startpt], Strand=RMDB$Strand[startpt],
                      Start=RMDB$Start[startpt], End=RMDB$End[endpt],
                      stringsAsFactors=FALSE)
dataUnmerged=RMDB[doNotMerge,]

基本上doMerge告诉你,需要合并(刚刚成立,从每一个运行的程序行startpt到相应endpt )。 和doNotMerge是所有其他线(假定的长度validMerge在数据的长度)。

然后dataMerged只需直接构建合并的数据应该是什么样子-显然NameStrandStart从行继承startptEnd从排继承endpt 。 (如果有其他感兴趣的栏目,你就必须决定他们来自哪里,很明显......)中的行数dataMerged匹配的长度startptendpt当然。 最后, dataUnmerged是没有资格对合并的所有行。

希望上面的都有道理,而且很明显,如果你把dataMergeddataUnmerged和重新排列,以得到一切回到原来的顺序(大概有可用于该索引列),那么你有希望的结果。

我希望上面会运行得非常,非常快确实:)



Answer 2:

这里的GenomicRanges解决方案。 第一步骤中,只有一次性,是安装的软件包和它的依赖

source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")

然后加载包并创建一个GRanges从您的数据,我已经把所谓的数据帧实例df

library(GenomicRanges)
gr = with(df, GRanges(Chromosome, IRanges(Start, End), Strand, repName=repName))

您的数据实际上是一个有点复杂,一个,其中每个列表元素是由一个基因定义的“农庄的名单”,所以

grl = split(gr, values(gr)$repName)

reduce这个元件逐元素的基础上,允许减小时相邻元件之间的最小间隙宽度为100。所以

merged = reduce(grl, min.gapwidth=100L)

你可以迫使这回data.frameas(merged, "data.frame") 之前强制貌似结果

> merged
GRangesList of length 8:
$AluJo
GRanges with 1 range and 0 elementMetadata cols:
      seqnames         ranges strand
         <Rle>      <IRanges>  <Rle>
  [1]     chr1 [31436, 31733]      +

$MIR3
GRanges with 7 ranges and 0 elementMetadata cols:
      seqnames         ranges strand
  [1]     chr1 [20001, 20400]      +
  [2]     chr1 [23201, 23400]      +
  [3]     chr1 [24001, 24600]      +
  [4]     chr1 [20000, 20001]      -
  [5]     chr1 [21000, 21400]      -
  [6]     chr1 [22000, 22200]      -
  [7]     chr1 [23000, 23200]      -

作为一个data.frame

> as(merged, "data.frame")
   element seqnames start   end width strand
1    AluJo     chr1 31436 31733   298      +
2     MIR3     chr1 20001 20400   400      +
3     MIR3     chr1 23201 23400   200      +
4     MIR3     chr1 24001 24600   600      +
5     MIR3     chr1 20000 20001     2      -
6     MIR3     chr1 21000 21400   401      -
7     MIR3     chr1 22000 22200   201      -
8     MIR3     chr1 23000 23200   201      -

对于一百万行,安排到10万个基因,我们有

> length(grl)
[1] 100000
> table(elementLengths(grl))
    10
100000
> system.time(reduce(grl, min.gapwidth=100))
   user  system elapsed
  9.468   0.064   9.553


Answer 3:

我首先从数据帧(再矢量化不同列df )。 做必要的操作如下,然后把它们放在里面data-frame

name_strand=paste(df$Name,df$Strand,sep=",")  #for grouping
st=df$Start
en=df$End
unq=unique(name_strand) #get the unique Name+Strand tags
ls1=list()
for(i in 1:length(unq)){
  index=which(name_strand %in% unq[i]) #get the index ids or group by unique Name+Strand 
  un_ls=unlist(strsplit(unq[i],split=","))
  ls1[[i]]=list(Name=un_ls[1],Strand=un_ls[2],Start=min(st[index]),End=max(en[index]))
}
df=as.data.frame(do.call(rbind, ls1))

因此,这也将解决这种情况时,你有NameStrand发生在其它地方的data-frame

编辑

因为有你的问题的疑虑,我如果你想修改的方案gap_distance <100,也考虑到了data-frame行可能无法订购。 所以我的解决方案采用这两种考虑(你可以修改它,如果这些行从高到低的顺序排列已经StartEnd

for(i in 1:length(unq)){
  index=which(name_strand %in% unq[i]) #get the index ids or group by unique Name+Strand 
  un_ls=unlist(strsplit(unq[i],split=","))
  #previous sol: ls1[[i]]=list(Name=un_ls[1],Strand=un_ls[2],Start=min(st[index]),End=max(en[index]))

  new_st=st[index] #new Start vector from the "st" vector for this group by name_strand
  new_en=en[index]
  new_st=new_st[order(new_st)] #when data-frame rows are not ordered:
  new_en=new_en[order(new_en)]  #"order" method gives the index in ascending order of START

  #using embed method to lag and taking the diff for checking with gap distance considerations
  gap_diff=embed(new_st,2)[,1]-new_en[-length(new_en)]  #subtract from End vector except the last element of End  
  a_ind_st=1 #index of new_st (Start) vector
  b_ind_en=1 #index of new_en (End) vector     
  ls_ind=1  #index for list
  for (j in 1:(length(new_en)-1)){  #also acts as the index for gap_diff
      if (gap_diff[j] < gap_distance){ #where gap_distance = 100 (consideration #2) < or <=
          b_ind_en=j+1
           if (j + 1 > length(new_en)-1){  #special case for last element in vector
              ls1[[ls_ind]]=list(Name=un_ls[1],Strand=un_ls[2],Start=new_st[a_ind_st],End=new_en[b_ind_en]))
              ls_ind=ls_ind+1 #increment list index
           }
       }else{
          ls1[[ls_ind]]=list(Name=un_ls[1],Strand=un_ls[2],Start=new_st[a_ind_st],End=new_en[b_ind_en]))
          ls_ind=ls_ind+1 #increment list index
          a_ind_st=b_ind_en+1
          b_ind_en=b_ind_en+1
      }      
   }
}
df=as.data.frame(do.call(rbind, ls1))

此修改将采取一切都要考虑进去。 如果你不需要任何限制,你可以修改的解决方案。



文章来源: Custom Merge Function in R