Writing a loop to create ggplot figures with diffe

2020-06-17 04:16发布

问题:

I do not have experience with loops but it looks like I will need to create some of them to analyze my data properly. Could you show how to create a simple loop on the code which I already created ? Let's use looping to get some plots:

pdf(file = sprintf("complex I analysis", tbl_comp_abu1), paper='A4r')

ggplot(df_tbl_data1_comp1, aes(Size_Range, Abundance, group=factor(Gene_Name))) +
  theme(legend.title=element_blank()) +
  geom_line(aes(color=factor(Gene_Name))) +
  ggtitle("Data1 - complex I")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(df_tbl_data2_comp1, aes(Size_Range, Abundance, group=factor(Gene_Name))) +
  theme(legend.title=element_blank()) +
  geom_line(aes(color=factor(Gene_Name))) +
  ggtitle("Data2 - complex I")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


ggplot(df_tbl_data3_comp1, aes(Size_Range, Abundance, group=factor(Gene_Name))) +
  theme(legend.title=element_blank()) +
  geom_line(aes(color=factor(Gene_Name))) +
  ggtitle("Datas3 - complex I")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

dev.off()

The question now is what I would like to achieve. So first of all I have like 10 complexes to analyze so that means 10 pdf files should be created and the example shows plots from three different data sets for the complex one. To make it properly the number in variable comp1 (from df_tbl_dataX_comp1) has to be changed from 1 to 10 - depends which complex we want to plot. The next thing which has to be changed through the loop is the name of pdf file and each of graphs... Is it possible to write such loop ?

Data:

structure(list(Size_Range = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 
8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 11L, 11L, 11L, 12L, 12L, 12L, 
13L, 13L, 13L, 14L, 14L, 14L, 15L, 15L, 15L, 16L, 16L, 16L, 17L, 
17L, 17L, 18L, 18L, 18L, 19L, 19L, 19L, 20L, 20L, 20L), .Label = c("10", 
"34", "59", "84", "110", "134", "165", "199", "234", "257", "362", 
"433", "506", "581", "652", "733", "818", "896", "972", "1039"
), class = "factor"), Abundance = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 142733.475, 108263.525, 98261.11, 649286.165, 
3320759.803, 3708515.148, 6691260.945, 30946562.92, 180974.3725, 
4530005.805, 21499827.89, 0, 15032198.54, 4058060.583, 0, 3842964.97, 
2544030.857, 0, 1640476.977, 286249.1775, 0, 217388.5675, 1252965.433, 
0, 1314666.05, 167467.8825, 0, 253798.15, 107244.9925, 0, 207341.1925, 
15755.485, 0, 71015.85, 14828.5075, 0, 25966.2325, 0, 0, 0, 0, 
0, 0), Gene_Name = c("AT1G01080", "AT1G01090", "AT1G01320", "AT1G01420", 
"AT1G01470", "AT1G01560", "AT1G01800", "AT1G02150", "AT1G02500", 
"AT1G02560", "AT1G02780", "AT1G02880", "AT1G02920", "AT1G02930", 
"AT1G03030", "AT1G03090", "AT1G03110", "AT1G03130", "AT1G03220", 
"AT1G03230", "AT1G03330", "AT1G03475", "AT1G03630", "AT1G03680", 
"AT1G03870", "ATCG00420", "ATCG00470", "ATCG00480", "ATCG00490", 
"ATCG00500", "ATCG00650", "ATCG00660", "ATCG00670", "ATCG00740", 
"ATCG00750", "ATCG00842", "ATCG01100", "ATCG01030", "ATCG01114", 
"ATCG01665", "ATCG00770", "ATCG00780", "ATCG00800", "ATCG00810", 
"ATCG00820", "ATCG00722", "ATCG00744", "ATCG00855", "ATCG00853", 
"ATCG00888", "ATCG00733", "ATCG00766", "ATCG00812", "ATCG00821", 
"ATCG00856", "ATCG00830", "ATCG00900", "ATCG01060", "ATCG01110", 
"ATCG01120")), .Names = c("Size_Range", "Abundance", "Gene_Name"
), row.names = c(NA, -60L), class = "data.frame")

回答1:

This might do the trick: Initiate two loops, one for the complex iteration and a second for the dataset iteration. Then use paste0() or paste() to generate the correct filenames and headings.

PS.: I didn't test the code, since I dont have your data. But it should give you an idea.

#loop over complex    
for (c in 1:10) {

    #create pdf for every complex 
    pdf(file = paste0("complex", c, "analysis.pdf"), paper='A4r')

    #loop over datasets
    for(d in 1:3) {

    #plot
    ggplot(get(paste0("df_tbl_data",d,"_comp",c)), aes(Size_Range, Abundance, group=factor(Gene_Name))) +
      theme(legend.title=element_blank()) +
      geom_line(aes(color=factor(Gene_Name))) +
      ggtitle(paste0("Data",d," - complex ",c))+
      theme(axis.text.x = element_text(angle = 90, hjust = 1))
    }   
    dev.off()

}


回答2:

So after making my answer, I realized it doesn't address the actual question about loops. However, I hope it shows you a different way of approaching your root problem (a.k.a I didn't want the work to go to waste).

I couldn't get your plot to work with the data you posted. There are 60 unique gene names in a 60-row data frame. When you try to make a geom_line and group by gene (aes(group=Gene_name)), you only have one point for each line. You need two points to make a line.

I made up some data and did an analysis.

# Function to generate random data
generate_data = function() {
  require(truncnorm)
  require(dplyr)

  gene_names = LETTERS[1:20]
  n_genes = length(gene_names)
  size_ranges = c(10, 34, 59, 84, 110, 134, 165, 199, 
                  234, 257, 362, 433, 506, 581, 652, 
                  733, 818, 896, 972, 1039)
  gene_size_means = rtruncnorm(n_genes, 10, 1000, 550, 300)
  genes_in_complex = rbinom(n_genes, 1, 0.3)
  true_variance = 50
  gene_size_variances = rchisq(n_genes, n_genes-1) * (true_variance/(n_genes-1))
  df = data.frame(gene_name=gene_names, 
                  gene_mean=gene_size_means, 
                  gene_var=gene_size_variances,
                  in_complex=genes_in_complex)
  df = df %>% group_by(gene_name) %>% 
    do(data.frame(size_ranges, 
                  abundance=dnorm(size_ranges, .$gene_mean, .$gene_var)*.$in_complex))
  return(df)
}

# Generate a list of tables. Each table is for one data set for one complex
data_tables = list()
n_comps = 3
for( complex_i in 1:2 ) {
  for( comp_j in 1:n_comps ) {
    loop_df = generate_data()
    loop_df$comp = comp_j
    loop_df$complex = complex_i
    data_tables = c(data_tables, list(loop_df))
  }
}

# Concatenate the tables into a larger data frame
dat = do.call(rbind, data_tables)

# Make a plots for each data set for complex 1
dat_complex1 = subset(dat, complex==1)
p = ggplot(dat_complex1, aes(x=size_ranges, y=abundance, color=gene_name, group=gene_name)) +
  geom_line() + 
  facet_wrap(~comp, ncol=1)
print(p)

# Make a plot with many subpanels for all complexes and data sets
p %+% dat + facet_grid(comp~complex) # screenshot shown below

So you're studying protein complexes in Arabidopsis? In case someone is familiar with your domain, a sentence of background might help them answering your question. Alternatively, a picture of the desired output could help. Also, some more complete example data and/or screenshots might generate more interest in your future posts.



回答3:

Have a look at this approach. It depends on a data.frame (dat) that contains the names of your datasets, the graph titles, as well as the file names.

First I create a function that creates the plot and saves it, then I call the function in a for-loop and also in an apply-loop (use apply where possible, its faster).

The code looks like this:

# create a custom function for ggplot, 
# which creates the plot and then saves it as a pdf
custom_ggplot_function <- function(input.data.name, graph.title, f.name){
  # get(input.data.name) gets you the variable which is stored as a string in
  # input.data.name

  p <- ggplot(get(input.data.name), aes(Size_Range, Abundance, group=factor(Gene_Name))) +
    theme(legend.title=element_blank()) +
    geom_line(aes(color=factor(Gene_Name))) +
    ggtitle(graph.title)+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

  ggsave(filename = paste0(f.name, ".pdf"), plot = p)
  NULL
}

# dat contains the names of your datasets, the titles of the graphs and filenames
dat <- data.frame(df.names = c("df_tbl_data1_comp1",
                              "df_tbl_data2_comp1"),
                  graph.titles = c("Data1 - Complex I",
                                   "Data2 - Complex II"),
                  file.names = c("file1", "file2"))
# If you create your data.frame dat, you can also say 
# df.names  = paste0("df_tbl_data", 1:10, "_comp1") and
# graph.titles = paste0("Data", 1:10, " - Complex ", 1:10)     


# loop through the rows of dat
for (i in 1:nrow(dat)) {
  custom_ggplot_function(input.data.name = dat[i, "df.names"],
                         graph.title = dat[i, "graph.titles"], 
                         f.name = dat[i, "file.names"])
}

# or using the apply function
apply(dat, 1, function(row.el) {
  custom_ggplot_function(input.data.name = row.el["df.names"], 
                         graph.title = row.el["graph.titles"], 
                         f.name = row.el["file.names"])
})


标签: r ggplot2