The input.txt contains 8000000 rows and 4 columns. The first 2 columns is text.The last 2 columns is number. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. The value of columns 3 and 4 is the number of unique symbols of columns 1 and 2 after splitting by "]" respectively. Each row of input.txt file is like this:
c33]c21]c5]c7]c8]c9 TPS2]MIC17]ERG3]NNF1]CIS3]CWP2 6 6
**The desired result:
row[ , ] represents characters like "c33 c21 c5 c7 c8 c9" or "TPS2 MIC17 ERG3 NNF1 CIS3 CWP2", | .| represents the number of characters, |c33 c21 c5 c7 c8 c9|=6
If two rows are overlapped (>=0.6), it outputs the NO. of these two rows to a file.**
This code is as follows, but it runs too slow.
The code:
library(compiler)
enableJIT(3)
data<-read.table("input.txt",header=FALSE)
row<-8000000
for (i in 1:(row-1)){
row11<-unlist(strsplit(as.character(data[i,1]),"]"))
row12<-unlist(strsplit(as.character(data[i,2]),"]"))
s1<-data[i,3]*data[i,4]
zz<-file(paste("output",i,".txt",sep=""),"w")
for (j in (i+1):row)
{ row21<-unlist(strsplit(as.character(data[j,1]),"]"))
row22<-unlist(strsplit(as.character(data[j,2]),"]"))
up<-length(intersect(row11,row21))*length(intersect(row12,row22))
s2<-data[j,3]*data[j,4]
down<-min(s1,s2)
if ((up/down)>=0.6) cat(i,"\t",j,"\n",file=zz,append=TRUE)
}
close(zz)
}
The running result: each row can produce a file, it is like this:
1 23
1 67
1 562
1 78
...
In order to run fast, I rewrite the code.The code is as follows
The input.txt contains 16000000 rows. The number of columns is not fixed. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. Each two rows of input.txt file is like this:
The 1st row (odd row1): c33 c21 c5 c7 c8
The 2nd row (even row1): TPS2 MIC17 ERG3 NNF1 CIS3 CWP2 MCM6
The 3rd row (odd row2): c33 c21 c5 c21 c18 c4 c58
The 4th row (even row2): TPS12 MIC3 ERG2 NNF1 CIS4
**The desired result:
If two rows are overlapped (>=0.6) with other two rows, it outputs the NO. of these two rows to a file.**
The code:
library(compiler)
enableJIT(3)
con <- file("input.txt", "r")
zz<-file("output.txt","w")
oddrow1<-readLines(con,n=1)
j<-0
i<-0
while( length(oddrow1) != 0 ){
oddrow1<-strsplit(oddrow1," ")
evenrow1<-readLines(con,n=1)
evenrow1<-strsplit(evenrow1," ")
j<-j+1
con2 <- file("input.txt", "r")
readLines(con2,n=(j*2))
oddrow2<-readLines(con2,n=1)
i<-j
while( length(oddrow2) != 0 ){
i<-i+1
oddrow2<-strsplit(oddrow2," ")
evenrow2<-readLines(con2,n=1)
evenrow2<-strsplit(evenrow2," ")
oddrow1<-unlist(oddrow1)
oddrow2<-unlist(oddrow2)
evenrow1<-unlist(evenrow1)
evenrow2<-unlist(evenrow2)
up<-length(intersect(oddrow1,oddrow2))*length(intersect(evenrow1,evenrow2))
down<-min(length(oddrow1)*length(evenrow1),length(oddrow2)*length(evenrow2))
if ((up/down)>=0.6) {cat(j,"\t",i,"\n",file=zz,append=TRUE) }
oddrow2<-readLines(con2,n=1)
}
close(con2)
oddrow1<-readLines(con,n=1)
}
close(con)
close(zz)
The running result: it can produce a file, it is like this:
1 23
1 67
1 562
1 78
2 25
2 89
3 56
3 79
...
Both the above two methods are too slow, In order to run fast,how to rewrite this code. Thank you!