ggplot: how to limit output in bar plot so only mo

2019-05-09 13:16发布

问题:

I have been searching for this simple thing for hours now, but to no avail. I have a dataframe with one of the columns the variable "country". I want two things the following:

  • Plot the most frequent countries, most frequent on top (partial solution found EDIT full solution found >> focus question on limiting output in bar plot based on frequency);
  • Only show the top x "most frequent" countries, moving the rest into 'Other' variable.

I tried to ggplot table() or summary() but that does not work. Is it even possible within ggplot, or should I use barchart (I managed to do this using barchart, just using summary(df$something) and adding max = x). I also wanted to stack the output (different questions about country).

Most frequent countries on top:

ggplot(aDDs,aes(x=
                  factor(answer,
                         levels=names(sort(table(answer),increasing=TRUE))
                         ),fill=question
                )
      ) + geom_bar() + coord_flip()

Suggestions are very very welcome.

====== EDIT3: I continued working on the code based on the suggestion by @CMichael, but now encountered another, quite strange, thing. Because this 'ifelse' problem concerns a slightly one question than my original one, I have posted a separate question for this matter. Please check it here: R: ifelse function returns vector position instead of value (string)

====== EDIT:

The aDDs example is reproduced below - aDDs dataset can be downloaded here:

temp <- structure(list(student = c(2270285L, 2321254L, 75338L, 2071594L,1682771L, 1770356L, 2155693L, 3154864L, 3136979L, 2082311L),answer = structure(c(181L, 87L, 183L, 89L, 115L, 183L, 172L,180L, 175L, 125L), .Label = c("Congo", "Guinea-Bissau", "Solomon Islands","Central African Rep", "Comoros", "Equatorial Guinea", "Liechtenstein","Nauru", "Brunei", "Djibouti", "Kiribati", "Papua New Guinea","Samoa", "South Sudan", "Tajikistan", "Tonga", "Bhutan","Gabon", "Laos", "Lesotho", "Maldives", "Micronesia", "St Kitts and Nevis","Mozambique", "Niger", "Andorra", "Cape Verde", "Mauritania","Antigua and Deps", "Chad", "Guinea", "Malta", "Burundi","Eritrea", "Iceland", "Kyrgyzstan", "Turkmenistan", "Azerbaijan","Dominica", "Belize", "Malawi", "Mali", "Moldova", "Benin","Cuba", "Gambia", "Luxembourg", "St Lucia", "Angola", "Cambodia","Georgia", "Madagascar", "Oman", "Kosovo", "Kuwait", "Namibia","Bahrain", "Congo - Democratic Rep", "Montenegro", "Senegal","Sierra Leone", "Togo", "Botswana", "Fiji", "Libya", "Uzbekistan","Guyana", "Mongolia", "Somalia", "Zambia", "Estonia", "Ivory Coast","Myanmar", "Grenada", "Qatar", "Saint Vincent and the Grenadines","Tanzania", "Armenia", "Bahamas", "Belarus", "Burkina", "Liberia","Afghanistan", "Latvia", "Yemen", "Mauritius", "Albania","Barbados", "Iraq", "Macedonia", "Nicaragua", "Panama", "Slovenia","Lebanon", "Slovakia", "Kazakhstan", "Paraguay", "Korea South","Suriname", "Czech Republic", "Rwanda", "Haiti", "Lithuania","Israel", "Zimbabwe", "Cyprus", "Honduras", "Uruguay", "Syria","Finland", "Tunisia", "Taiwan", "Uganda", "Denmark", "Austria","Sri Lanka", "Vietnam", "Bosnia Herzegovina", "Thailand","Norway", "Trinidad and Tobago", "Switzerland", "Nepal","Sudan", "Jamaica", "Japan", "United Arab Emirates", "Bolivia","New Zealand", "Ethiopia", "Jordan", "Cameroon", "Croatia","Sweden", "Kenya", "Singapore", "Guatemala", "Ireland Republic","Saudi Arabia", "Bulgaria", "Malaysia", "Belgium", "Dominican Republic","Algeria", "El Salvador", "Bangladesh", "Serbia", "Ghana","Costa Rica", "Indonesia", "Hungary", "Venezuela", "Ecuador","Ukraine", "Romania", "Turkey", "China", "Morocco", "Russian Federation","Peru", "South Africa", "Argentina", "Portugal", "Iran","Poland", "Italy", "Chile", "France", "Germany", "Australia","Philippines", "Egypt", "Greece", "Nigeria", "Canada", "Pakistan","United Kingdom", "Mexico", "Colombia", "Brazil", "Netherlands","Spain", "India", "United States"), class = "factor"), question = c("C1-pres","C1-pres", "C1-pres", "C1-pres", "C1-pres", "C1-pres", "C1-pres","B1-pres", "B1-pres", "B1-pres")), .Names = c("student","answer", "question"), row.names = c("156", "203", "280", "347","412", "478", "534", "1649651", "1649691", "1649763"), class = "data.frame")

回答1:

For the filtering question you should introduce a new column:

data$filteredCountry = ifelse(data$value > threshold, data$country, "other")

Now you can use filteredCountry as your x in the aesthetics.

The data ordering question pops up every now and then (e.g., ggplot2: sorting a plot). You need to order your country factor levels by the underlying values. Your reorder command seems to sort by country name again, I would expect something like reorder(country,frequency) but sample data would help.

UPDATE: With the now provided data it becomes obvious that you need to create summary dataset:

data <- read.table("aDDs.csv",sep=",",header=T)
require(plyr)
summary <- ddply(data,.(answer),summarise,freq=length(answer))

This yields the data frame summary with one entry for each country (181 in total). Now you can do the filtering and the reordering:

threshold = quantile(summary$freq,0.9)
summary $filteredCountry = ifelse(summary$freq > threshold, summary$answer, "other")
summary$filteredCountry = reorder(summary$filteredCountry,-summary$freq)

Now you can plot:

require(ggplot2)
p=ggplot(data=summary,aes(x=filteredCountry,y=freq))
p = p+geom_bar(aes(fill=filteredCountry),stat="identity")
p


回答2:

Thanks to suggestions from @CMichael and answers to another - related - post here on SO. I managed to create a stacked and ordered bar plot using ggplot:

create a list with most frequent country names

temp <- row.names(as.data.frame(summary(aDDs$answer, max=12))) # create a df or something else with the summary output.
aDDs$answer <- as.character(aDDs$answer) # IMPORTANT! Here was the problem: turn into character values

create new column that filters top results

aDDs$top <- ifelse(
        aDDs$answer %in% temp, ## condition: match aDDs$answer with row.names in summary df 
        aDDs$answer, ## then it should be named as aDDs$answer
        "Other" ## else it should be named "Other"
      )
aDDs$top <- as.factor(aDDs$top) # factorize the output again

plot

ggplot(aDDs,aes(x=
                  factor(top,
                         levels=names(sort(table(top),increasing=TRUE))
                  ),fill=question
                )
                ) + geom_bar() + coord_flip()

And here the output (still needs some tweaking, but it is what I wanted):



标签: r ggplot2