大熊猫的GroupBy和数据框中选择最常用的值大熊猫的GroupBy和数据框中选择最常用的值(Gro

2019-05-06 10:54发布

站内文章 / 移动开发

57 0

傲

女 | 书童

私信

我有三个字符串列的数据帧。我知道，在第3列中的只有一个值适用于前两种的每个组合。为了清洁由前两列我有对数据进行分组由数据帧和选择每个组合中的第三列的最常见的值。

我的代码：

import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

的代码不能正常工作的最后一行，它说：“主要错误‘短名称’”如果我尝试组只由城市，然后我得到了一个AssertionError。我能做些什么解决？

Answer 1:

您可以使用value_counts()得到一个数列，并获得第一行：

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

Answer 2:

2019答案， `pd.Series.mode`可用。

使用groupby ， GroupBy.agg ，并应用pd.Series.mode功能，每个组：

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

如果这是需要为数据帧，使用

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

有关有用的东西Series.mode是，它总是返回一个系列，使其具有非常兼容的agg和apply ，重建GROUPBY输出时尤其如此。它也更快。

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Series.mode也做时有多种模式的工作：

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

或者，如果你想为每种模式单独一行，您可以使用GroupBy.apply ：

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

如果你不在乎 ，只要它是他们中的任何一个返回的模式，那么你就需要调用一个lambda mode ，并提取第一个结果。

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

您还可以使用statistics.mode从Python，但...

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

......不必处理多种模式时，它不能很好地工作; 一个StatisticsError提高。这是在文档中提到：

如果数据是空的，或者如果不是正好有一个最常用的值，StatisticsError提高。

但是你可以看到自己...

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

Answer 3:

对于agg的lambba功能得到一个Series ，它不具有'Short name'属性。

stats.mode返回两个阵列的元组，那么你必须采取在这个元组中的第一阵列的第一个元素。

有了这两个简单的changements：

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

回报

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

Answer 4:

有点晚了这里的比赛，但我也陷入了与HYRY的解决方案的一些性能问题，所以我不得不想出一个又一个。

它的工作原理通过找出每个键值的频率，然后，针对每一个琴键，只保持与它最常出现的值。

还有一个支持多种模式的附加解决方案。

在规模测试是代表我的工作数据，这降低了运行时从37.4s到0.5秒！

下面是解决方案，一些示例使用，且规模测试的代码：

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .groupby(key_cols + [count_col])[value_col].unique() \
             .to_frame().reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

运行这段代码将打印类似：

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

希望这可以帮助！

Answer 5:

从形式上看，正确的答案是@eumiro解决方案。的@HYRY解决的问题是，当你有一个数字序列，例如[1,2,3,4]的解决方案是错误的，也就是说，你没有模式。例：

import pandas as pd
df = pd.DataFrame({'client' : ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E','E','E','A'], 'total' : [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 'bla':[10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]})

如果你喜欢计算你@HYRY获得：

df.groupby(['socio']).agg(lambda x: x.value_counts().index[0])

你获得：

这显然是错误的（见应该是1，而不是4 A值），因为它不能具有唯一值处理。

因此，其它的解决方案是正确的：

import scipy.stats
df3.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0])

得到：

Answer 6:

对于更大的数据集稍微笨拙但速度更快的方法涉及获取计数的利益列，计数排序最高到最低，然后去复制一个子集只保留最大的案件。

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short name' : ['NY','New','Spb','NY']})

grouped_df = source.groupby(['Country','City','Short name']
                   )[['Short name']].count().rename(columns={ 
                   'Short name':'count'}).reset_index()
grouped_df = grouped_df.sort_values('count',ascending=False)
grouped_df = grouped_df.drop_duplicates(subset=['Country','City']).drop('count', axis=1)
grouped_df

Answer 7:

这个问题在这里是性能，如果你有很多的行会是一个问题。

如果是你的话，请用这个尝试：

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()

Answer 8:

如果你想解决它是不依赖于另一种方法value_counts或scipy.stats可以使用Counter集合

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

这可以应用到这样上面的例子

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)

文章来源: GroupBy pandas DataFrame and select most common value