How can I obtain the largest set of rows that shar

2019-04-13 12:16发布

问题:

I have a matrix containing gene names and sample numbers. Each row is a logical vector indicating the samples in which a gene was detected. Genes must appear in a minimum of 4 samples out of 8 to make it this far (still be in the matrix). i.e., all genes in this matrix appear in 4 or more samples.

       Sample1  Sample2  Sample3  Sample4 Sample5 Sample6  Sample7  Sample8 
gene1  TRUE     FALSE    TRUE     TRUE    TRUE    FALSE    FALSE    FALSE
gene2  FALSE    TRUE     FALSE    TRUE    FALSE   TRUE     TRUE     FALSE
gene3  TRUE     TRUE     FALSE    TRUE    FALSE   TRUE     TRUE     FALSE
gene4  FALSE    FALSE    TRUE     FALSE   TRUE    FALSE    FALSE    TRUE
gene5  TRUE     TRUE     TRUE     TRUE    TRUE    FALSE    TRUE     TRUE
gene6  FALSE    FALSE    TRUE     FALSE   FALSE   TRUE     TRUE     TRUE
gene7  TRUE     TRUE     FALSE    FALSE   TRUE    TRUE     FALSE    FALSE
gene8  TRUE     TRUE     TRUE     TRUE    FALSE   FALSE    FALSE    FALSE

I could also say I have the list of samples for which the latter was expressed, such as:

> gene1
[1] "Sample1"  "Sample3"  "Sample4"  "Sample5"

How can I obtain the largest set of genes (rows) that belong to a common set of 4 samples (columns)?

Edit: This question stems from trying to recreate this:

Outlier analysis is based on the assumption that samples (cells) of the same type also have a set of commonly-expressed genes.

The outlier algorithm iteratively trims the low-expressing genes in an expression file until 95% of the genes that remain are expressed above the Limit of Detection (LoD) value that you set for half of the samples.

The assumption is that the set of samples contains less than 50% outliers. This means that subsequent calculations will only include the half of the samples that have the highest expression for the trimmed gene list.

The trimmed gene list represents genes that are present above the LoD in at least half the samples or the most evenly expressed genes—though they might not be the highest or lowest in their expression value.

For the 50% of the samples that remain, a distribution is calculated that represents their combined expression values for the gene list defined above. For this distribution, the median represents the 50th percentile expression value for the set of data.

回答1:

I'm guessing you want to find the genes that co-exist in any 4 of the samples. You could try something like:

n = 4               
combs = combn(seq_along(colnames(mat)), n, simplify = F)
Filter(function(x) length(x) > 1, 
       setNames(lapply(combs, function(i) names(which(rowSums(mat[, i]) == n))), 
                lapply(combs, function(x) paste0(colnames(mat)[x], collapse = "; "))))
#$`Sample1; Sample2; Sample3; Sample4`
#[1] "gene5" "gene8"
#
#$`Sample1; Sample2; Sample4; Sample7`
#[1] "gene3" "gene5"
#
#$`Sample1; Sample3; Sample4; Sample5`
#[1] "gene1" "gene5"
#
#$`Sample2; Sample4; Sample6; Sample7`
#[1] "gene2" "gene3"

Where "mat":

mat = structure(c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, 
FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, 
FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, 
TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, 
TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, 
FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, 
FALSE, TRUE, TRUE, TRUE, FALSE, FALSE), .Dim = c(8L, 8L), .Dimnames = list(
    c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7", 
    "gene8"), c("Sample1", "Sample2", "Sample3", "Sample4", "Sample5", 
    "Sample6", "Sample7", "Sample8")))


回答2:

It is not very clear what the expected result would be. If "m1" is the initial logical matrix, create a subset of matrix ("m2") that has at least 4 TRUE per each row. If you need the column names of the elements that are TRUE for each row, loop it using apply with "MARGIN=1"

m2 <- m1[rowSums(m1)>=4,]
apply(m2, 1, function(x) colnames(m2)[x])