I have a large data file in the form:
Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6
1.09 0.162 NA 2.312 1.876 0.12 0.812
0.687 NA 0.987 1.32 1.11 1.04 NA
NA 1.890 0.923 1.43 0.900 2.02 2.7
2.801 0.642 0.791 0.812 NA 0.31 1.60
1.33 1.33 NA 1.22 0.23 0.18 1.77
2.91 1.00 1.651 NA 1.55 3.20 0.99
2.00 2.31 0.89 1.13 1.25 0.12 1.55
I would like to make a distribution of the totals in each column that are over 2.0. For example, Set_1 > 2 = 1, Set_2 > 2 = 0, Set_3 > 2 = 1. The issue is that each column has a "random" amount of missing data (NA). So that messes up the distribution. It seems my only option is to do a distribution of percentages. For example: Set_1 > 2 = 1/6, Set_2 > 2 = 0/5, Set_3 > 2 = 1/6. I would like to make a distribution of these percentages into a bell-curve of binned histogram. Despite my example, the percentages in each column over 2 should be between 0.00% and 3.00% so bins of size 0.05 would be nice. I would then like to plot my Input_SNP percentage on that distribution to get a p-value. Do you guys know how to do this in R? Currently this is in both a data.frame file and a .csv?
I had been trying: hist(colSums(as.matrix(df) > 2))
but that had not been working (I think because of the NAs). So how can I incorporate that?
My desired output is a histogram of percentages of each column that is over 2. The bins in the histogram can be 0.05.
Perhaps you could try this, assuming your data is in a
data.frame
calleddf
:In reality this is a 3 step process, first
result <- sapply(df, function(x) which(x>2)
will give you the following structure:And this is inserted in a
lapply()
of the following form:For the following structure:
Finally this is unlisted for the final form.
If
Input_SNP
should not be part of the desired result, remove it from thedf
inside thesapply()
, like so:Finally for the proportions:
If you just want a histogram of the proportion of non-missing values >2, you can just do
The
df[,-1]
remove theIndex_SNP
column, and we usecolMeans
on the boolean values to get proportions.