Transform data frame into matrix with counts

2019-07-20 11:53发布

问题:

I have data files structured like this:

OTU1    PIA0    1120
OTU2    PIA1    2
OTU2    PIA3    6
OTU2    PIA4    10
OTU2    PIA5    1078
OTU2    PIN1    24
OTU2    PIN2    45
OTU2    PIN3    261
OTU2    PIN4    102
OTU3    PIA0    16
OTU3    PIA1    59
OTU3    PIA2    27
OTU3    PIA3    180
OTU3    PIA4    200
OTU3    PIA5    251
OTU3    PIN0    36
OTU3    PIN1    61
OTU3    PIN2    156
OTU3    PIN3    590
OTU3    PIN4    277
OTU4    PIA0    401
OTU4    PIN0    2

And I want to create a matrix that shows combination of data from the second column taking the first column as reference for the counts of combination (showing how many times, measured each one by the first column number -OTU1, OTU2, OTU3, OTU4- each datum from the second column appears together with each other in the same OTU). It needs to look like this:

    PIA0  PIA1  PIA2  PIA3  PIA4  PIA5  PIN0  PIN1  PIN2  PIN3  PIN4
PIA0  1     1     1     1     1     1     2     1     1     1     1 
PIA1  1     0     1     2     2     2     1     2     2     2     2
PIA2  1     1     0     1     1     1     1     1     1     1     1
PIA3  1     2     1     0     2     2     1     2     2     2     2
PIA4  1     2     1     2     0     2     1     2     2     2     2
PIA5  1     2     1     2     2     0     1     2     2     2     2
PIN0  2     1     1     1     1     1     0     1     1     1     1
PIN1  1     2     1     2     2     2     1     0     2     2     2
PIN2  1     2     1     2     2     2     1     2     0     2     2
PIN3  1     2     1     2     2     2     1     2     2     0     2
PIN4  1     2     1     2     2     2     1     2     2     2     0

Data shared between a row and a column with the same name reflects the number of times this datum appears alone in an OTU.

Any ideas?

I have read about R libraries 'reshape2' and command 'acast' here, but with that I can only change the shape of a matrix with all data in it, not make combination counts as desired. I have also been thinking about a Biopython script, but I think it would be too big and difficult to write it down with my little knowledge about programming.

The goal is to build a matrix like the one in the example so I can run CIRCOS online program with these data.

回答1:

You can use dcast to create a binary matrix indicating the presence of each PI inside each OTU, and then multiply it by itself to have the counts.

d <- read.fwf( textConnection("
OTU1    PIA0    1120
OTU2    PIA1    2
OTU2    PIA3    6
OTU2    PIA4    10
OTU2    PIA5    1078
OTU2    PIN1    24
OTU2    PIN2    45
OTU2    PIN3    261
OTU2    PIN4    102
OTU3    PIA0    16
OTU3    PIA1    59
OTU3    PIA2    27
OTU3    PIA3    180
OTU3    PIA4    200
OTU3    PIA5    251
OTU3    PIN0    36
OTU3    PIN1    61
OTU3    PIN2    156
OTU3    PIN3    590
OTU3    PIN4    277
OTU4    PIA0    401
OTU4    PIN0    2"), widths=c(8,8,10), header=FALSE, skip=1 )

library(reshape2)
A <- as.matrix( dcast( V1 ~ V2, data=d, length )[,-1]>0 )
#          PIA0     PIA1     PIA2     PIA3     PIA4     PIA5     PIN0     PIN1     PIN2     PIN3     PIN4    
# [1,]     TRUE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE
# [2,]    FALSE     TRUE    FALSE     TRUE     TRUE     TRUE    FALSE     TRUE     TRUE     TRUE     TRUE
# [3,]     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE
# [4,]     TRUE    FALSE    FALSE    FALSE    FALSE    FALSE     TRUE    FALSE    FALSE    FALSE    FALSE
t(A) %*% A
#              PIA0     PIA1     PIA2     PIA3     PIA4     PIA5     PIN0     PIN1     PIN2     PIN3     PIN4    
# PIA0            3        1        1        1        1        1        2        1        1        1        1
# PIA1            1        2        1        2        2        2        1        2        2        2        2
# PIA2            1        1        1        1        1        1        1        1        1        1        1
# PIA3            1        2        1        2        2        2        1        2        2        2        2
# PIA4            1        2        1        2        2        2        1        2        2        2        2
# PIA5            1        2        1        2        2        2        1        2        2        2        2
# PIN0            2        1        1        1        1        1        2        1        1        1        1
# PIN1            1        2        1        2        2        2        1        2        2        2        2
# PIN2            1        2        1        2        2        2        1        2        2        2        2
# PIN3            1        2        1        2        2        2        1        2        2        2        2
# PIN4            1        2        1        2        2        2        1        2        2        2        2