R中托梅克链接的快速计算(fast computation of Tomek link in R)

2019-09-23 06:53发布

我想实施应对不平衡数据托梅克的链接。 该代码是用于二元分类问题,其中,1类是多数类和0类是少数。 X上的开关输入,Y我写了下面的代码,但我正在寻找一种方式来加快计算输出。

我怎样才能提高我的代码?

#########################
#remove overlapping observation using tomek links
#given observations i and j belonging to different classes
#(i,j) is a Tomek link if there is NO example z, such that d(i, z) < d(i, j) or d(j , z) < d(i, j)
#find tomek links and remove only the observations of the tomek links belonging to majority class (0 class).
#########################
tomekLink<-function(X,Y,distType="euclidean"){
i.1<-which(Y==1)
i.0<-which(Y==0)
X.1<-X[i.1,]
X.0<-X[i.0,]
i.tomekLink=NULL
j.tomekLink=NULL
#i and j belong to different classes
timeTomek<-system.time({
for(i in i.1){
    for(j in i.0){
        d<-dst(X,i,j,distType)
        obsleft<-setdiff(1:nrow(X),c(i,j))
        for(z in obsleft){
            if ( dst(X,i,z,distType)<d | dst(X,j,z,distType)<d ){
                break() #(i,j) is not a Tomek link, get next pair (i,j)
                } 
            #if z is the last obs and d(i, z) > d(i, j) and d(j , z) > d(i, j),then (i,j) is a Tomek link
            if(z==obsleft[length(obsleft)]){
                if ( dst(X,i,z,distType)>d & dst(X,j,z,distType)>d ){
                    #(i,j) is a Tomek link
                    #cat("\n tomeklink obs",i,"and",j)
                    i.tomekLink=c(i.tomekLink,i)
                    j.tomekLink=c(j.tomekLink,j)
                    #since we want to eliminate only majority class observations
                    #remove j from i.0 to speed up the loop
                    i.0<-setdiff(i.0,j)
                    }
                }
            }
        }
    }
})  
print(paste("Time to find tomek links:",round(timeTomek[3],digit=2))) 
#id2keep<-setdiff(1:nrow(X),c(i.tomekLink,j.tomekLink))
id2keep<-setdiff(1:nrow(X),j.tomekLink)
cat("numb of obs removed usign tomeklink",nrow(X)-length(id2keep),"\n",
    (nrow(X)-length(id2keep))/nrow(X)*100,"% of training ;",
    (length(j.tomekLink))/length(which(Y==0))*100,"% of 0 class")
X<-X[id2keep,]
Y<-Y[id2keep]
cat("\n prop of 1 afer TomekLink:",(length(which(Y==1))/length(Y))*100,"% \n")
return(list(X=X,Y=Y))
}


#distance measure used in tomekLink function
dst<-function(X,i,j,distType="euclidean"){
d<-dist(rbind(X[i,],X[j,]), method= distType)
return(d)
}

Answer 1:

我没有测试你的代码,但是从乍看之下,似乎是预先分配会有所帮助。 不使用i.tomekLink = C(i.tomekLink,I),但尝试分配的存储器,用于存储链接托梅克的先验。

另一个想法是计算所有样本的所有样本的距离矩阵,只要看看最近的邻居对每个样品。 如果从不同类的,那么你有一个托梅克链接。



文章来源: fast computation of Tomek link in R