我有三个字符串列的数据帧。 我知道,在第3列中的只有一个值适用于前两种的每个组合。 为了清洁由前两列我有对数据进行分组由数据帧和选择每个组合中的第三列的最常见的值。
我的代码:
import pandas as pd
from scipy import stats
source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})
print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])
的代码不能正常工作的最后一行,它说:“主要错误‘短名称’”如果我尝试组只由城市,然后我得到了一个AssertionError。 我能做些什么解决?
Answer 1:
您可以使用value_counts()
得到一个数列,并获得第一行:
import pandas as pd
source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})
source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
Answer 2:
2019答案, pd.Series.mode
可用。
使用groupby
, GroupBy.agg
,并应用pd.Series.mode
功能,每个组:
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
如果这是需要为数据帧,使用
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()
Short name
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
有关有用的东西Series.mode
是,它总是返回一个系列,使其具有非常兼容的agg
和apply
,重建GROUPBY输出时尤其如此。 它也更快。
# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Series.mode
也做时有多种模式的工作:
source2 = source.append(
pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
ignore_index=True)
# Now `source2` has two modes for the
# ("USA", "New-York") group, they are "NY" and "New".
source2
Country City Short name
0 USA New-York NY
1 USA New-York New
2 Russia Sankt-Petersburg Spb
3 USA New-York NY
4 USA New-York New
source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
Country City
Russia Sankt-Petersburg Spb
USA New-York [NY, New]
Name: Short name, dtype: object
或者,如果你想为每种模式单独一行,您可以使用GroupBy.apply
:
source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)
Country City
Russia Sankt-Petersburg 0 Spb
USA New-York 0 NY
1 New
Name: Short name, dtype: object
如果你不在乎 ,只要它是他们中的任何一个返回的模式,那么你就需要调用一个lambda mode
,并提取第一个结果。
source2.groupby(['Country','City'])['Short name'].agg(
lambda x: pd.Series.mode(x)[0])
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
您还可以使用statistics.mode
从Python,但...
source.groupby(['Country','City'])['Short name'].apply(statistics.mode)
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
......不必处理多种模式时,它不能很好地工作; 一个StatisticsError
提高。 这是在文档中提到:
如果数据是空的,或者如果不是正好有一个最常用的值,StatisticsError提高。
但是你可以看到自己...
statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values
Answer 3:
对于agg
的lambba功能得到一个Series
,它不具有'Short name'
属性。
stats.mode
返回两个阵列的元组,那么你必须采取在这个元组中的第一阵列的第一个元素。
有了这两个简单的changements:
source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])
回报
Short name
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Answer 4:
有点晚了这里的比赛,但我也陷入了与HYRY的解决方案的一些性能问题,所以我不得不想出一个又一个。
它的工作原理通过找出每个键值的频率,然后,针对每一个琴键,只保持与它最常出现的值。
还有一个支持多种模式的附加解决方案。
在规模测试是代表我的工作数据,这降低了运行时从37.4s到0.5秒!
下面是解决方案,一些示例使用,且规模测试的代码:
import numpy as np
import pandas as pd
import random
import time
test_input = pd.DataFrame(columns=[ 'key', 'value'],
data= [[ 1, 'A' ],
[ 1, 'B' ],
[ 1, 'B' ],
[ 1, np.nan ],
[ 2, np.nan ],
[ 3, 'C' ],
[ 3, 'C' ],
[ 3, 'D' ],
[ 3, 'D' ]])
def mode(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.
The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the mode.
The output is a DataFrame with a record per group that has at least one mode
(null values are not counted). The `key_cols` are included as columns, `value_col`
contains a mode (ties are broken arbitrarily and deterministically) for each
group, and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size() \
.to_frame(count_col).reset_index() \
.sort_values(count_col, ascending=False) \
.drop_duplicates(subset=key_cols)
def modes(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.
The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the modes.
The output is a DataFrame with a record per group that has at least
one mode (null values are not counted). The `key_cols` are included as
columns, `value_col` contains lists indicating the modes for each group,
and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size() \
.to_frame(count_col).reset_index() \
.groupby(key_cols + [count_col])[value_col].unique() \
.to_frame().reset_index() \
.sort_values(count_col, ascending=False) \
.drop_duplicates(subset=key_cols)
print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')
scale_test_data = [[random.randint(1, 100000),
str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
data=scale_test_data)
start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start
start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start
start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start
运行这段代码将打印类似:
key value
0 1 A
1 1 B
2 1 B
3 1 NaN
4 2 NaN
5 3 C
6 3 C
7 3 D
8 3 D
key value count
1 1 B 2
2 3 C 2
key count value
1 1 2 [B]
2 3 2 [C, D]
0.489614009857
9.19386196136
37.4375009537
希望这可以帮助!
Answer 5:
从形式上看,正确的答案是@eumiro解决方案。 的@HYRY解决的问题是,当你有一个数字序列,例如[1,2,3,4]的解决方案是错误的,也就是说,你没有模式 。 例:
import pandas as pd
df = pd.DataFrame({'client' : ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E','E','E','A'], 'total' : [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 'bla':[10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]})
如果你喜欢计算你@HYRY获得:
df.groupby(['socio']).agg(lambda x: x.value_counts().index[0])
你获得:
这显然是错误的(见应该是1,而不是4 A值),因为它不能具有唯一值处理。
因此,其它的解决方案是正确的:
import scipy.stats
df3.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0])
得到:
Answer 6:
对于更大的数据集稍微笨拙但速度更快的方法涉及获取计数的利益列,计数排序最高到最低,然后去复制一个子集只保留最大的案件。
import pandas as pd
source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})
grouped_df = source.groupby(['Country','City','Short name']
)[['Short name']].count().rename(columns={
'Short name':'count'}).reset_index()
grouped_df = grouped_df.sort_values('count',ascending=False)
grouped_df = grouped_df.drop_duplicates(subset=['Country','City']).drop('count', axis=1)
grouped_df
Answer 7:
这个问题在这里是性能,如果你有很多的行会是一个问题。
如果是你的话,请用这个尝试:
import pandas as pd
source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short_name' : ['NY','New','Spb','NY']})
source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()
Answer 8:
如果你想解决它是不依赖于另一种方法value_counts
或scipy.stats
可以使用Counter
集合
from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]
这可以应用到这样上面的例子
src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short_name' : ['NY','New','Spb','NY']})
src.groupby(['Country','City']).agg(get_most_common)
文章来源: GroupBy pandas DataFrame and select most common value