Subsetting DataFrame in R by duplicate values for

2019-07-13 01:51发布

问题:

I have a data frame which looks like this

> fitchRatings
               Country Month Year FitchLongTerm LongTermTransformed
1            Abu Dhabi     7 2007            AA                  22
2               Angola     5 2012           BB-                  12
3               Angola     5 2011           BB-                  12
4               Angola     5 2010            B+                  11
5            Argentina     7 2010             B                  10
6            Argentina    12 2008            RD                   3
7            Argentina     8 2006            RD                   3
8            Argentina    12 2005            RD                   3
9            Argentina     6 2005           DDD                   2
10           Argentina     1 2005             D                   0

As you can see, for some Countries, there are multiple observations for a single year. I want to subset the DF so that I keep only one observation for each country-year and the observation I want to keep is the one that has the smallest value for "LongTermTransformed".

In this data set Country and LongTermTransformed are factors and Year is an integer.

Thanks in advance.

回答1:

There are many ways to subset the rows based on the minimum value in a column with grouping columns. One option is to get the index of the 'min' value with which.min after converting the 'LongTermTransformed' to 'numeric'. We can use slice to subset the rows identified by the index.

library(dplyr)
fitchRatings %>%
          group_by(Country, Year) %>%
          slice(which.min(as.numeric(as.character(LongTermTransformed))))

Or we can use a similar option with data.table. The difference is we convert the 'data.frame' to 'data.table' (setDT) and for subsetting .SD was used.

library(data.table)#v1.9.5+
setDT(fitchRatings)[, 
 .SD[which.min(as.numeric(levels(LongTermTransformed))[LongTermTransformed])],
              by = .(Country, Year)]


回答2:

Another data.table options is just to order by LongTermTransformed and take the first value per Country and Year using unique

library(data.table)
unique(setDT(fitchRatings)[order(as.numeric(as.character(LongTermTransformed)))], by = c("Country", "Year"))

Or using dplyr

library(dplyr)
fitchRatings %>%
  arrange(as.numeric(as.character(LongTermTransformed))) %>%
  distinct(Country, Year)