Using R to cluster based on euclidean distance and

2020-04-21 06:06发布

问题:

I am trying to figure out how to read in a counts matrix into R, and then cluster based on euclidean distance and a complete linkage metric. The original matrix has 56,000 rows (genes) and 7 columns (treatments). I want to see if there is a clustering relationship between the treatments. However, every time I try to do this, I first get an error stating, Error: cannot allocate vector of size 544.4 Gb Since I'm trying to reproduce work that has been published by someone else, I am wondering if I am making a mistake with my initial data entry.

Second, if I try such clustering with just 20 genes of the 56,000, I am able to make a clustering dendrogram, but the branches are no experimental samples. The paper I am trying to replicate did such clustering with the resulting dendrogram displaying clustering samples.

Here is the code I am trying to run:

exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(matrix(exprs),method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)

And here is a sample of my data table:

    AGS KATOIII MKN45   N87 SNU1    SNU5    SNU16
1_DDR1  11.18467721 11.91358171 11.81568242 11.08565284 8.054326631 12.46899188 10.54972491
2_RFC2  9.19869822  9.609015734 8.925772678 8.3641799   8.550993726 10.32160527 9.421779056
3_HSPA6 6.455324139 6.088320986 7.949175048 6.128573129 6.113793411 6.317460116 7.726657567
4_PAX8  8.511225092 8.719103196 8.706242048 8.705618546 8.696547633 9.292782564 8.710369119
5_GUCA1A    3.773404228 3.797729793 3.574286779 3.848753216 3.684193193 3.66065606  3.88239872
6_UBA7  6.477543321 6.631538303 6.506133756 6.433793116 6.145507918 6.92197071  6.479113995
7_THRA  6.263090367 6.507397854 6.896879084 6.696356125 6.243160864 6.936051147 6.444444498
8_PTPN21    6.88050894  6.342007735 6.55408163  6.099950167 5.836763044 5.904301086 6.097067306
9_CCL5  6.197989448 4.00619542  4.445053893 7.350765625 3.892650264 7.140038596 4.123639647
10_CYP2E1   4.379433632 4.867741561 4.719912827 4.547433566 6.530890968 4.187701905 4.453267508
11_EPHB3    6.655231606 7.984278173 7.025962652 7.111129175 6.246989328 6.169529157 6.546374446
12_ESRRA    8.675023046 9.270153715 8.948209029 9.412638347 9.4470612   9.98312055  9.534236722
13_CYP2A6   6.834018146 7.18386746  6.826740822 7.244411918 6.744588768 6.715122111 7.302922762
14_SCARB1   8.856802264 8.962211232 8.975200168 9.710291176 9.120002571 10.29588004 10.55749325
15_TTLL12   8.659539601 9.93935462  8.309244963 9.21145716  9.792647852 10.46958091 10.51879844
16_LINC00152    5.108632654 4.906321384 4.958158343 5.315532543 5.456138001 5.242577092 5.180295902
17_WFDC2    5.595843025 5.590991341 5.776102664 5.622086284 5.273603946 5.304240608 5.573746302
18_MAPK1    6.970036434 5.739881305 4.927993642 5.807358161 7.368137365 6.17697538  5.985006279
19_MAPK1    8.333269232 8.758733916 7.855324572 9.03596893  7.808283302 7.675434022 7.450262521
20_ADAM32   4.075355477 4.216259982 4.653654879 4.250333684 4.648194266 4.250333684 4.114286071

The rows describe genes (Ex., 1_DDR1, 2_RFC2, etc.) and the columns are experimental samples (Ex. AGS, KATOIII). I wish to see the relatedness of the samples in the cluster.

Here is my sample dendrogram that my code produces. I thought it would only show 7 branches reflecting my 7 samples:

The paper's dendrogram (including these 8 samples and many more as well) is below:

Thanks for any help you can provide!

回答1:

You're running out of RAM. That's it. You can't allocate a vector that exceeds your memory space. Move to a computer with more memory or maybe, try use bigmemory (I've never tried it).

https://support.bioconductor.org/p/53848/



回答2:

In case anybody was wondering, the answer to my second question is below. I was calling as.matrix on a matrix, and it was screwing up the data. The following code works now!

exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(exprs,method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)


回答3:

Do you want to cluster on columns (detect similarities between treatments) or on rows (detect similarities between genes)? It sounds like you want the former, given that you're expecting 7 dendrogram branches for 7 treatments.

If so, then you need to transpose your dataset. dist computes a distance matrix for rows, not columns, which is not what you want.

Once you've done the transpose, your clustering should take no time at all, and minimal memory.