R code runs too slow,how to rewrite this code

2019-07-28 06:23发布

问题:

The input.txt contains 8000000 rows and 4 columns. The first 2 columns is text.The last 2 columns is number. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. The value of columns 3 and 4 is the number of unique symbols of columns 1 and 2 after splitting by "]" respectively. Each row of input.txt file is like this:


c33]c21]c5]c7]c8]c9  TPS2]MIC17]ERG3]NNF1]CIS3]CWP2  6  6


**The desired result:

row[ , ] represents characters like "c33 c21 c5 c7 c8 c9" or "TPS2 MIC17 ERG3 NNF1 CIS3 CWP2", | .| represents the number of characters, |c33 c21 c5 c7 c8 c9|=6

If two rows are overlapped (>=0.6), it outputs the NO. of these two rows to a file.**

This code is as follows, but it runs too slow.

The code:

 library(compiler)
 enableJIT(3)
 data<-read.table("input.txt",header=FALSE)
 row<-8000000
for (i in 1:(row-1)){
    row11<-unlist(strsplit(as.character(data[i,1]),"]"))
     row12<-unlist(strsplit(as.character(data[i,2]),"]"))
    s1<-data[i,3]*data[i,4]
    zz<-file(paste("output",i,".txt",sep=""),"w")
   for (j in (i+1):row)
      { row21<-unlist(strsplit(as.character(data[j,1]),"]"))
        row22<-unlist(strsplit(as.character(data[j,2]),"]"))
        up<-length(intersect(row11,row21))*length(intersect(row12,row22))
        s2<-data[j,3]*data[j,4]
        down<-min(s1,s2)
       if ((up/down)>=0.6) cat(i,"\t",j,"\n",file=zz,append=TRUE)
      }
   close(zz)
}

The running result: each row can produce a file, it is like this:

1 23
1 67
1 562
1 78
...

In order to run fast, I rewrite the code.The code is as follows

The input.txt contains 16000000 rows. The number of columns is not fixed. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. Each two rows of input.txt file is like this:

The 1st row  (odd row1):  c33 c21 c5 c7 c8
The 2nd row (even row1): TPS2 MIC17 ERG3 NNF1 CIS3 CWP2 MCM6
The 3rd row  (odd row2): c33 c21 c5 c21 c18 c4 c58
The 4th row (even row2): TPS12 MIC3 ERG2 NNF1 CIS4

**The desired result:

If two rows are overlapped (>=0.6) with other two rows, it outputs the NO. of these two rows to a file.**

The code:

 library(compiler)
    enableJIT(3)
    con <- file("input.txt", "r")
    zz<-file("output.txt","w")
    oddrow1<-readLines(con,n=1)  
    j<-0
    i<-0 
    while( length(oddrow1) != 0 ){
    oddrow1<-strsplit(oddrow1," ")
    evenrow1<-readLines(con,n=1)
    evenrow1<-strsplit(evenrow1," ")
    j<-j+1
    con2 <- file("input.txt", "r")
    readLines(con2,n=(j*2))
    oddrow2<-readLines(con2,n=1) 
    i<-j
    while( length(oddrow2) != 0 ){
       i<-i+1
       oddrow2<-strsplit(oddrow2," ")
       evenrow2<-readLines(con2,n=1)
       evenrow2<-strsplit(evenrow2," ")
       oddrow1<-unlist(oddrow1)
       oddrow2<-unlist(oddrow2)
       evenrow1<-unlist(evenrow1)
       evenrow2<-unlist(evenrow2)
       up<-length(intersect(oddrow1,oddrow2))*length(intersect(evenrow1,evenrow2))
       down<-min(length(oddrow1)*length(evenrow1),length(oddrow2)*length(evenrow2))

       if ((up/down)>=0.6) {cat(j,"\t",i,"\n",file=zz,append=TRUE)  } 
       oddrow2<-readLines(con2,n=1)
       }
    close(con2)
    oddrow1<-readLines(con,n=1)
    }
    close(con)  
    close(zz)

The running result: it can produce a file, it is like this:

1 23
1 67
1 562
1 78
2 25
2 89
3 56
3 79
 ...

Both the above two methods are too slow, In order to run fast,how to rewrite this code. Thank you!

回答1:

Well, I suspect uses too much memory for your size of data, but perhaps it will provoke some ideas.

Make up some data, with 20 total unique values and 5 to 10 in each cell.

set.seed(5)
n <- 1000L
ng <- 20
g1 <- paste(sample(10000:99999, ng))
g2 <- paste(sample(10000:99999, ng))
n1 <- sample(5:10, n, replace=TRUE)
n2 <- sample(5:10, n, replace=TRUE)
x1 <- sapply(n1, function(i) paste(g1[sample(ng, i)], collapse="|"))
x2 <- sapply(n2, function(i) paste(g2[sample(ng, i)], collapse="|"))

Load Matrix library and a helper function that takes a list of string vectors and converts them to a matrix with number of columns equal to the number of unique strings and 1's where it was present.

library(Matrix)
str2mat <- function(s) {
  n <- length(s)
  ni <- sapply(s, length)
  s <- unlist(s)
  u <- unique(s)
  spMatrix(nrow=n, ncol=length(u), i=rep(1L:n, ni), j=match(s, u), x=rep(1, length(s)))
}

OK, now we can actually do something. First create the matrices and get the total number present in each row.

m1 <- str2mat(strsplit(x1, "|", fixed=TRUE))
m2 <- str2mat(strsplit(x2, "|", fixed=TRUE))
n1 <- rowSums(m1)
n2 <- rowSums(m2)

Now we can use crossproducts of these matrices to get the numerator, and outer to get the minimum to get the numerator. We then can compute the overlap and test if > 0.6. Since we have the whole matrix, we're not interested in the diagonal or the lower half. (There's ways of storing this kind of matrix more efficiently with Matrix library, but I'm not sure how.) We then get the rows that have enough overlap with which.

num <- tcrossprod(m1)*tcrossprod(m2)
n12 <- n1*n2
den <- outer(n12, n12, pmin)
use <- num/den > 0.6
diag(use) <- FALSE
use[lower.tri(use)] <- FALSE
out <- which(use, arr.ind=TRUE)

> head(out)
     [,1] [,2]
[1,]   64   65
[2,]   27   69
[3,]   34   81
[4,]   26   82
[5,]    5   85
[6,]   21  115


标签: r loops bigdata