R distribution plot with NA data and thresholds

I have a large data file in the form:

Input_SNP   Set_1    Set_2     Set_3     Set_4     Set_5     Set_6
1.09        0.162    NA        2.312     1.876     0.12      0.812
0.687       NA       0.987     1.32      1.11      1.04      NA
NA          1.890    0.923     1.43      0.900     2.02      2.7
2.801       0.642    0.791     0.812     NA        0.31      1.60
1.33        1.33     NA        1.22      0.23      0.18      1.77
2.91        1.00     1.651     NA        1.55      3.20      0.99
2.00        2.31     0.89      1.13      1.25      0.12      1.55

I would like to make a distribution of the totals in each column that are over 2.0. For example, Set_1 > 2 = 1, Set_2 > 2 = 0, Set_3 > 2 = 1. The issue is that each column has a "random" amount of missing data (NA). So that messes up the distribution. It seems my only option is to do a distribution of percentages. For example: Set_1 > 2 = 1/6, Set_2 > 2 = 0/5, Set_3 > 2 = 1/6. I would like to make a distribution of these percentages into a bell-curve of binned histogram. Despite my example, the percentages in each column over 2 should be between 0.00% and 3.00% so bins of size 0.05 would be nice. I would then like to plot my Input_SNP percentage on that distribution to get a p-value. Do you guys know how to do this in R? Currently this is in both a data.frame file and a .csv?

I had been trying: hist(colSums(as.matrix(df) > 2)) but that had not been working (I think because of the NAs). So how can I incorporate that?

My desired output is a histogram of percentages of each column that is over 2. The bins in the histogram can be 0.05.

标签： r dataframe distribution

2条回答

Root（大扎）

2楼-- · 2019-02-28 01:54

Perhaps you could try this, assuming your data is in a data.frame called df:

result <- unlist(lapply(sapply(df, function(x) which(x>2)), function(x) length(x)))
result
#Input_SNP     Set_1     Set_2     Set_3     Set_4     Set_5     Set_6 
#    2         1         0         1         0         2         1

In reality this is a 3 step process, first result <- sapply(df, function(x) which(x>2) will give you the following structure:

#List of 7
#$ Input_SNP: int [1:2] 4 6
#$ Set_1    : int 7
#$ Set_2    : int(0) 
#$ Set_3    : int 1
#$ Set_4    : int(0) 
#$ Set_5    : int [1:2] 3 6
#$ Set_6    : int 3

And this is inserted in a lapply() of the following form:

lapply(result, function(x) length(x))

For the following structure:

#List of 7
#$ Input_SNP: int 2
#$ Set_1    : int 1
#$ Set_2    : int 0
#$ Set_3    : int 1
#$ Set_4    : int 0
#$ Set_5    : int 2
#$ Set_6    : int 1

Finally this is unlisted for the final form.

If Input_SNP should not be part of the desired result, remove it from the df inside the sapply(), like so:

unlist(lapply(sapply(df[,-1], function(x) which(x>2)), function(x) length(x)))
#Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 
#1     0     1     0     2     1

Finally for the proportions:

result/colSums(!is.na(df[,-1]))
#    Set_1     Set_2     Set_3     Set_4     Set_5     Set_6 
#0.1666667 0.0000000 0.1666667 0.0000000 0.2857143 0.1666667

0人赞添加讨论(0) 举报

我只想做你的唯一

3楼-- · 2019-02-28 02:09

If you just want a histogram of the proportion of non-missing values >2, you can just do

hist(colMeans(as.matrix(df[,-1]) > 2, na.rm=TRUE))

The df[,-1] remove the Index_SNP column, and we use colMeans on the boolean values to get proportions.

0人赞添加讨论(0) 举报

R distribution plot with NA data and thresholds

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间