如何检索的数据帧在当前列中的重复值(How to retrieve the most repeate

2019-07-31 06:14发布

I am trying to retrieve the most repeated value in a particular column present in a data frame.Here is my sample data and code below.A

data("Forbes2000", package = "HSAUR")
head(Forbes2000)


  rank                name        country             category  sales profits  assets marketvalue
1    1           Citigroup  United States              Banking  94.71   17.85 1264.03      255.30
2    2    General Electric  United States        Conglomerates 134.19   15.59  626.93      328.54
3    3 American Intl Group  United States            Insurance  76.66    6.46  647.66      194.87
4    4          ExxonMobil  United States Oil & gas operations 222.88   20.96  166.99      277.02
5    5                  BP United Kingdom Oil & gas operations 232.57   10.27  177.57      173.54
6    6     Bank of America  United States              Banking  49.01   10.81  736.45      117.55

As per my sample data I need to return the most repeated category which is Insurance.

subset(subset(Forbes2000,country=="Bermuda")

Answer 1:

tail(names(sort(table(Forbes2000$category))), 1)


Answer 2:

如果两个或多个类别可并列为最常见,使用这样的:

x <- c("Insurance", "Insurance", "Capital Goods", "Food markets", "Food markets")
tt <- table(x)
names(tt[tt==max(tt)])
[1] "Food markets" "Insurance" 


Answer 3:

另一种方式与data.table包,这是大数据集的速度更快:

set.seed(1)
x=sample(seq(1,100), 5000000, replace = TRUE)

方法1(上文提出的解决方案)

start.time <- Sys.time()
tt <- table(x)
names(tt[tt==max(tt)])
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

的4.883488秒的时间差

方法2(数据表)

start.time <- Sys.time()
ds <- data.table( x )
setkey(ds, x)
sorted <- ds[,.N,by=list(x)]

most_repeated_value <- sorted[order(-N)]$x[1]
most_repeated_value

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

的0.328033秒的时间差



Answer 4:

您可以使用table(Forbes2000$CategoryName, useNA="ifany") 这会给你所选类别中的所有可能值的列表,并多次在那个特定的数据帧所使用的每个值的数量。



Answer 5:

我知道我的答案是晚来了一点,但我建立了下面的函数,没有工作,在不到我的数据帧第二包含超过50,000行:

print_count_of_unique_values <- function(df, column_name, remove_items_with_freq_equal_or_lower_than = 0, return_df = F, 
                                         sort_desc = T, return_most_frequent_value = F)
{
  temp <- df[column_name]
  output <- as.data.frame(table(temp))
  names(output) <- c("Item","Frequency")
  output_df <- output[  output[[2]] > remove_items_with_freq_equal_or_lower_than,  ]

  if (sort_desc){
    output_df <- output_df[order(output_df[[2]], decreasing = T), ]
  }

  cat("\nThis is the (head) count of the unique values in dataframe column '", column_name,"':\n")
  print(head(output_df))

  if (return_df){
    return(output_df)
  }

  if (return_most_frequent_value){
      output_df$Item <- as.character(output_df$Item)
      output_df$Frequency <- as.numeric(output_df$Frequency)
      most_freq_item <- output_df[1, "Item"]
      cat("\nReturning most frequent item: ", most_freq_item)
      return(most_freq_item)
  }
}

所以如果你有一个名为“DF”和所谓的“名”列数据框,你想知道在“名称”列中的备注值,你可以运行:

most_common_name <- print_count_of_unique_values(df=df, column_name = "name", return_most_frequent_value = T)    


文章来源: How to retrieve the most repeated value in a column present in a data frame
标签: r dataframe max