R subset functions, including '[' not work

2019-07-24 17:52发布

问题:

I'm having a strange issue where I am looping over a large data frame to create a 3D barplot from the data in 2 columns, where the Z axis is the frequency. The original data frame looks like this (please excuse excess columns):

> head(MergedBH)
                   Row.names           V1.x            V2.x V3.x  V4.x V5.x
RFL_Contig1       RFL_Contig1    RFL_Contig1 Scaffold3494078 1.00 1.000  470
RFL_Contig100   RFL_Contig100  RFL_Contig100 Scaffold2661063 0.61 0.975  236
RFL_Contig1000 RFL_Contig1000 RFL_Contig1000  Scaffold861300 0.96 0.995  451
RFL_Contig1001 RFL_Contig1001 RFL_Contig1001 Scaffold4753307 0.67 0.982  568
RFL_Contig1002 RFL_Contig1002 RFL_Contig1002  Scaffold317096 1.00 0.996 1513
RFL_Contig1003 RFL_Contig1003 RFL_Contig1003   Scaffold60619 0.90 1.000  698
                     V1.y                  V2.y V3.y  V4.y V5.y
RFL_Contig1       RFL_Contig1 ta_contig_5DS_2768763 1.00 1.000  572
RFL_Contig100   RFL_Contig100  ta_contig_4DS_482537 0.56 0.966  737
RFL_Contig1000 RFL_Contig1000 ta_contig_2AL_5829507 0.83 0.944 1573
RFL_Contig1001 RFL_Contig1001 ta_contig_7BS_3161139 1.00 0.999  910
RFL_Contig1002 RFL_Contig1002 ta_contig_3B_10401908 1.00 0.997 2681
RFL_Contig1003 RFL_Contig1003 ta_contig_2AL_6424276 0.70 1.000 1004

I want to create a 3d barplot where the x axis is $V4.x and the y axis is $V4.y. I don't use the typical hist2d function since so much weight is at the 1,1 position, and we want to visualize the weight at that position against the others as well. To do this I created a 3 column matrix with columns 1-2 containing all pairwise combinations in the range of V4.x and y respectively (.8-1 by .001), and the final column being the frequency. I do this with the lines below:

> for3d.mat <- matrix(ncol=3,nrow=0)
> for(i in seq(.8,1,by=.001)){for(j in seq(.8,1,by=.001)){iter.mat <- matrix(ncol=3,c(i,j,length(subset(MergedBH,MergedBH$V4.x==i & MergedBH$V4.y==j)$V4.x)));for3d.mat <- rbind(for3d.mat,iter.mat)}}
> subset(for3d.mat,for3d.mat[,1] == .975 & for3d.mat[,2] == .966)
 [,1] [,2] [,3]
> for3d.mat[35350:35325,]
   [,1]  [,2] [,3]
 [1,] 0.975 0.974    0
 [2,] 0.975 0.973    0
 [3,] 0.975 0.972    0
 [4,] 0.975 0.971    0
 [5,] 0.975 0.970    0
 [6,] 0.975 0.969    0
 [7,] 0.975 0.968    0
 [8,] 0.975 0.967    0
 [9,] 0.975 0.966    0
[10,] 0.975 0.965    0
[11,] 0.975 0.964    0
[12,] 0.975 0.963    0
[13,] 0.975 0.962    0
[14,] 0.975 0.961    0
[15,] 0.975 0.960    0
[16,] 0.975 0.959    0
[17,] 0.975 0.958    0
[18,] 0.975 0.957    0

Somehow the value for RFL_Contig100, .975,.966, is not picked up by subset when working on the large matrix, and when I find the correct row it has a value of 0 for the frequency, but if I take that one line out of the for loop and run it it makes the correct entry:

> matrix(ncol=3,c(i,j,length(subset(MergedBH,MergedBH$V4.x==i & MergedBH$V4.y==j)$V4.x)))
     [,1]  [,2] [,3]
[1,] 0.975 0.966    1

Any suggestions on what the issue is? I've tried a few different ways of doing this but can't get around the subset function, would there be another way to compute the depth for each bin in order to use for a 3D barplot to visualize all points at once?

Thanks in advance

Update:

Getting the same problem with '[', where a large part of the matrix, between .92 and .98 is not getting processed:

> for3d.mat <- matrix(ncol=3,nrow=0)
> for(i in seq(.8,1,by=.001)){for(j in seq(.8,1,by=.001)){iter.mat <- matrix(ncol=3,c(i,j,length(MergedBH[MergedBH$V4.x ==i & MergedBH$V4.y ==j,]$V4.x)));for3d.mat <- rbind(for3d.mat,iter.mat)}}
> for3d.mat[for3d.mat[,1] == .975 & for3d.mat[,2] == .966,]
 [,1] [,2] [,3]

Am able to use '[' or subset on most of the matrix, but there is just a specific range whether for the original data frame or the for3d.mat that is not accessible by either subsetting method, example below:

> for3d.mat[for3d.mat[,1] == .976 & for3d.mat[,2] == .937,]
[1] 0.976 0.937    NA
> for3d.mat[for3d.mat[,1] == .975 & for3d.mat[,2] == .937,]
 [,1] [,2] [,3]

回答1:

From ?subset:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

In other words, use [ directly when inside a loop or apply-style function.

I think there's a convenience function somewhat like subset in the new dplyr package that you might want to look into if [ becomes too onerous, but [ in conjunction with with usually works fine.