Plot line and bar graph (with secondary axis for l

2020-08-01 06:50发布

问题:

Problem

I have just started R two days back. I have gone through some basic R tutorials and I am able to plot two dimensional data. I pull data from an Oracle database. Now, I am having problems when I try to merge two graph types (Line and Bar) using secondary axis.

I have no problem, plotting this data on Excel. Following is the plot:

I am unable to plot it on R. I searched some related examples but I am unable to tweak it as per my requirements (Combining Bar and Line chart (double axis) in ggplot2)

Code

Following is the code I am using to plot bar and line graphs separately:

Bar:

p <- ggplot(data = df, aes(x = MONTHS, y = BASE)) + 
    geom_bar(stat="identity") + 
    theme_minimal() +
    geom_text(aes(label = BASE), vjust = 1.6, color = "White", size = 2.5)

Line:

p1 <- ggplot(data = df, aes(x = MONTHS, y = df$INTERNETPERCENTAGE, group = 1)) + 
    geom_line() + 
    geom_point()

Data

Update: I have the following data (raw data cleansed of "," and "%" signs):

> dput(head(df,20))
structure(list(MONTHS = structure(c(11L, 10L, 3L, 5L, 4L, 8L, 
1L, 9L, 7L, 6L, 2L, 13L, 12L), .Label = c("Apr-18", "Aug-18", 
"Dec-17", "Feb-18", "Jan-18", "Jul-18", "Jun-18", "Mar-18", "May-18", 
"Nov-17", "Oct-17", "Oct-18", "Sep-18"), class = "factor"), BASE = c(40756228L, 
41088219L, 41642601L, 42017111L, 42439446L, 42847468L, 43375319L, 
43440484L, 43464735L, 43326823L, 43190949L, 43015301L, 42780071L
), INTERNETUSERGREATERTHAN0KB = c(13380576L, 13224502L, 14044105L, 
14239169L, 14011423L, 14736043L, 14487827L, 14460410L, 14632695L, 
14896654L, 15019329L, 14141766L, 14209288L), INTERNETPERCENTAGE = c(33L, 
32L, 34L, 34L, 33L, 34L, 33L, 33L, 34L, 34L, 35L, 33L, 33L), 
    SMARTPHONE = c(11610216L, 11875033L, 12225965L, 12412010L, 
    12760251L, 12781082L, 13142400L, 13295826L, 13422476L, 13408216L, 
    13504339L, 13413596L, 13586438L), SMARTPHONEPERCENTAGE = c(28L, 
    29L, 29L, 30L, 30L, 30L, 30L, 31L, 31L, 31L, 31L, 31L, 32L
    ), INTERNETUSAGEGREATERTHAN0KB4G = c(829095L, 969531L, 1181411L, 
    1339620L, 1474300L, 1733027L, 1871816L, 1967129L, 2117418L, 
    2288215L, 2453243L, 2624865L, 2817199L)), row.names = c(NA, 
13L), class = "data.frame")

回答1:

ggplot is a "high-level" plotting library, meaning that it is built to express clear relationships in data, rather than a simple system for drawing shapes. One of its fundamental assumptions is that secondary or dual data axes are usually a bad idea; such figures plot more than one relationship into the same space, with no guarantee that the two axes actually share a meaningful relationship (see for example spurious correlations).

All that said, ggplot does indeed have the ability to define secondary axes, although using it for the purpose you describe is intentionally difficult. One way to accomplish your goal would be to split your data set into two separate ones, then plot these in the same ggplot object. It's certainly possible, but take note of how much extra code is required to produce the effect you're after.

library(tidyverse)
library(scales)

df.base <- df[c('MONTHS', 'BASE')] %>% 
  mutate(MONTHS = factor(MONTHS, MONTHS, ordered = T))

df.percent <- df[c('MONTHS', 'INTERNETPERCENTAGE', 'SMARTPHONEPERCENTAGE')] %>% 
  gather(variable, value, -MONTHS)

g <- ggplot(data = df.base, aes(x = MONTHS, y = BASE)) +
  geom_col(aes(fill = 'BASE')) +
  geom_line(data = df.percent, aes(x = MONTHS, y = value / 40 * 12500000 + 33500000, color = variable, group = variable)) +
  geom_point(data = df.percent, aes(x = MONTHS, y = value / 40 * 12500000 + 33500000, color = variable)) +
  geom_label(data = df.percent, aes(x = MONTHS, y = value / 40 * 12500000 + 33500000, fill = variable, label = sprintf('%i%%', value)), color = 'white', vjust = 1.6, size = 4) +
  scale_y_continuous(sec.axis = sec_axis(~(. - 33500000) / 12500000 * 40, name = 'PERCENT'), labels = comma) +
  scale_fill_manual(values = c('lightblue', 'red', 'darkgreen')) +
  scale_color_manual(values = c('red', 'darkgreen')) +
  coord_cartesian(ylim = c(33500000, 45500000)) +
  labs(fill = NULL, color = NULL) +
  theme_minimal()
print(g)



回答2:

Note that my answer is based on your original "un-cleaned" data (which I attach at the bottom of my post).

The key here is to transform percentage values such that they use the same range as BASE. We then apply the inverse of the transformation to show the original percentage values as a second y-axis.

A (personal) word of caution: Secondary axes are often not a good idea. Personally, I would use facets or two separate graphs to avoid confusion and overloading of the graph. Also note that Hadley himself is not a fan of dual y axes, so the ggplot2 support of dual axes is (sensibly) limited.

That aside, here is an option:

  1. First off, let's clean-up the data (remove the thousands separator, percentage sign etc.).

    library(tidyverse)
    df.clean <- df %>%
        mutate_if(is.factor, as.character) %>%
        gather(USAGE, PERCENTAGE, INTERNETPERCENTAGE, SMARTPHONEPERCENTAGE) %>%
        mutate(
            MONTHS = factor(MONTHS, levels = df$MONTHS),
            BASE = as.numeric(str_replace_all(BASE, ",", "")),
            PERCENTAGE = as.numeric(str_replace(PERCENTAGE, "%", "")))
    
  2. We now calculate the transformation coefficients:

    y1 <- min(df.clean$BASE)
    y2 <- max(df.clean$BASE)
    x1 <- min(df.clean$PERCENTAGE)
    x2 <- max(df.clean$PERCENTAGE)
    b <- (y2 - y1) / (x2 - x1)
    a <- y1 - b * x1
    
  3. Now for the plotting:

    df.clean %>%
        mutate(perc.scaled = a + b * PERCENTAGE) %>%
        ggplot(aes(MONTHS, BASE)) +
        geom_col(
            data = df.clean %>% distinct(MONTHS, .keep_all = TRUE),
            aes(MONTHS, BASE), fill = "dodgerblue4") +
        geom_point(aes(MONTHS, perc.scaled, colour = USAGE, group = USAGE)) +
        geom_line(aes(MONTHS, perc.scaled, colour = USAGE, group = USAGE)) +
        geom_label(
            aes(MONTHS, perc.scaled, label = PERCENTAGE, fill = USAGE),
            vjust = 1.4,
            show.legend = F) +
        scale_y_continuous(
                name =  "BASE",
                sec.axis = sec_axis(~ (. - a) / b, name = "Percentage")) +
        coord_cartesian(ylim = c(0.99 * min(df.clean$BASE), max(df.clean$BASE))) +
        theme_minimal() +
        theme(legend.position = "bottom")
    


Sample data

df <- structure(list(MONTHS = structure(c(11L, 10L, 3L, 5L, 4L, 8L,
1L, 9L, 7L, 6L, 2L, 13L, 12L), .Label = c("APR-18", "AUG-18",
"DEC-17", "FEB-18", "JAN-18", "JUL-18", "JUN-18", "MAR-18", "MAY-18",
"NOV-17", "OCT-17", "OCT-18", "SEP-18"), class = "factor"), BASE = structure(c(1L,
2L, 3L, 4L, 5L, 7L, 11L, 12L, 13L, 10L, 9L, 8L, 6L), .Label = c("40,756,228",
"41,088,219", "41,642,601", "42,017,111", "42,439,446", "42,780,071",
"42,847,468", "43,015,301", "43,190,949", "43,326,823", "43,375,319",
"43,440,484", "43,464,735"), class = "factor"), INTERNETUSERGREATERTHAN0KB = structure(c(2L,
1L, 4L, 7L, 3L, 11L, 9L, 8L, 10L, 12L, 13L, 5L, 6L), .Label = c("13,224,502",
"13,380,576", "14,011,423", "14,044,105", "14,141,766", "14,209,288",
"14,239,169", "14,460,410", "14,487,827", "14,632,695", "14,736,043",
"14,896,654", "15,019,329"), class = "factor"), INTERNETPERCENTAGE = structure(c(2L,
1L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 4L, 2L, 2L), .Label = c("32%",
"33%", "34%", "35%"), class = "factor"), SMARTPHONE = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 11L, 9L, 12L, 10L, 13L), .Label = c("11,610,216",
"11,875,033", "12,225,965", "12,412,010", "12,760,251", "12,781,082",
"13,142,400", "13,295,826", "13,408,216", "13,413,596", "13,422,476",
"13,504,339", "13,586,438"), class = "factor"), SMARTPHONEPERCENTAGE = structure(c(1L,
2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L), .Label = c("28%",
"29%", "30%", "31%", "32%"), class = "factor"), INTERNETUSAGEGREATERTHAN0KB4G = structure(c(12L,
13L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label = c("1,181,411 ",
"1,339,620 ", "1,474,300 ", "1,733,027 ", "1,871,816 ", "1,967,129 ",
"2,117,418 ", "2,288,215 ", "2,453,243 ", "2,624,865 ", "2,817,199 ",
"829,095 ", "969,531 "), class = "factor")), row.names = c(NA,
13L), class = "data.frame")


回答3:

You need to have a transformation factor that resembles the ratio of maximum y-axis 1 vs. maximum y-axis 2. Here, the secondary y-axis should be rougly 100,000 times smaller than the primary y-axis. Therefore:

Code

ggplot(df) + 
    geom_col(aes(x = MONTHS, y = BASE)) +
    # apply transformation factor to line plot
    geom_line(aes(x = MONTHS, y = INTERNETPERCENTAGE/0.000001, group = 1), 
              color = "red", size = 1) +
    theme_minimal() +
    geom_text(aes(x = MONTHS, y = BASE, label=BASE), 
              vjust=1.6, color="White", size=2.5) +
    # add secondary y-axis that is 100,000 times smaller
    scale_y_continuous(sec.axis = sec_axis(~.*0.000001, name = "Internet Percentage in %")) +
    labs(y = "Base", x = "Months")

Data

df <- structure(list(MONTHS = structure(c(17440, 17471, 17501, 17532, 17563, 17591, 17622, 17652, 17683, 17713, 17744, 17775, 17805), class = "Date"), BASE = c(40756228L, 41088219L, 41642601L, 42017111L, 42439446L, 42847468L, 43375319L, 43440484L, 43464735L, 43326823L, 43190949L, 43015301L, 42780071L), INTERNETUSERGREATERTHAN0KB = c(13380576L, 13224502L, 14044105L, 14239169L, 14011423L, 14736043L, 14487827L, 14460410L, 14632695L, 14896654L, 15019329L, 14141766L, 14209288L), INTERNETPERCENTAGE = c(33L, 32L, 34L, 34L, 33L, 34L, 33L, 33L, 34L, 34L, 35L, 33L, 33L), SMARTPHONE = c(11610216L, 11875033L, 12225965L, 12412010L, 12760251L, 12781082L, 13142400L, 13295826L, 13422476L, 13408216L, 13504339L, 13413596L, 13586438L), SMARTPHONEPERCENTAGE = c(28L, 29L, 29L, 30L, 30L, 30L, 30L, 31L, 31L, 31L, 31L, 31L, 32L), INTERNETUSAGEGREATERTHAN0KB4G = c(829095L, 969531L, 1181411L, 1339620L, 1474300L, 1733027L, 1871816L, 1967129L, 2117418L, 2288215L, 2453243L, 2624865L, 2817199L)), row.names = c(NA, 13L), class = "data.frame")

Explanation

The secondary y-axis is just visual. ggplot plots our geom_line() on the first y-axis (with values roughly around 33,000,000). The secondary y-axis is added later. You can see that if you check out

> ggplot_build(p)[[1]][[2]]
          y     x group PANEL colour size linetype alpha
1  33000000 17440     1     1    red    1        1    NA
2  32000000 17471     1     1    red    1        1    NA
3  34000000 17501     1     1    red    1        1    NA
4  34000000 17532     1     1    red    1        1    NA
5  33000000 17563     1     1    red    1        1    NA
6  34000000 17591     1     1    red    1        1    NA
7  33000000 17622     1     1    red    1        1    NA
8  33000000 17652     1     1    red    1        1    NA
9  34000000 17683     1     1    red    1        1    NA
10 34000000 17713     1     1    red    1        1    NA
11 35000000 17744     1     1    red    1        1    NA
12 33000000 17775     1     1    red    1        1    NA
13 33000000 17805     1     1    red    1        1    NA


标签: r ggplot2