可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have the following problem: I want to append columns to a dataframe. These columns are the unique values in another row of this dataframe, filled with the occurence of this value in this row. It looks like this:
df:
Column1 Column2
0 1 a,b,c
1 2 a,e
2 3 a
3 4 c,f
4 5 c,f
What I am trying to get is:
Column1 Column2 a b c e f
0 1 a,b,c 1 1 1
1 2 a,e 1 1
2 3 a 1
3 4 c,f 1 1
4 5 c,f 1 1
(the empty spaces can be nan or 0, it matters not.)
I have now written some code to aceive this, but instead of appending columns, it appends rows, so that my output looks like this:
Column1 Column2
0 1 a,b,c
1 2 a,e
2 3 a
3 4 c,f
4 5 c,f
a 1 1
b 1 1
c 1 1
e 1 1
f 1 1
The code looks like this:
def NewCols(x):
for i, value in df['Column2'].iteritems():
listi=value.split(',')
for value in listi:
string = value
x[string]=list.count(string)
return x
df1=df.apply(NewCols)
What I am trying to do here is to iterate through each row of the dataframe and split the string (a,b,c) contained in Column2 at comma, so the variable listi
is then a list containing the separated string values. For each of this values I then want to make a new column and fill it with the number of occurences of that value in listi
. I am confused why the code appends rows instead of columns. Does somebody know why and how I can correct that?
回答1:
While we could do this using get_dummies
, we can also cheat and use pd.value_counts
directly:
>>> df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'Column2': {0: 'a,b,c', 1: 'a,e', 2: 'a', 3: 'c,f', 4: 'c,f'}})
>>> df.join(df.Column2.str.split(",").apply(pd.value_counts).fillna(0))
Column1 Column2 a b c e f
0 1 a,b,c 1 1 1 0 0
1 2 a,e 1 0 0 1 0
2 3 a 1 0 0 0 0
3 4 c,f 0 0 1 0 1
4 5 c,f 0 0 1 0 1
Step-by-step, we have
>>> df.Column2.str.split(",")
0 [a, b, c]
1 [a, e]
2 [a]
3 [c, f]
4 [c, f]
dtype: object
>>> df.Column2.str.split(",").apply(pd.value_counts)
a b c e f
0 1 1 1 NaN NaN
1 1 NaN NaN 1 NaN
2 1 NaN NaN NaN NaN
3 NaN NaN 1 NaN 1
4 NaN NaN 1 NaN 1
>>> df.Column2.str.split(",").apply(pd.value_counts).fillna(0)
a b c e f
0 1 1 1 0 0
1 1 0 0 1 0
2 1 0 0 0 0
3 0 0 1 0 1
4 0 0 1 0 1
>>> df.join(df.Column2.str.split(",").apply(pd.value_counts).fillna(0))
Column1 Column2 a b c e f
0 1 a,b,c 1 1 1 0 0
1 2 a,e 1 0 0 1 0
2 3 a 1 0 0 0 0
3 4 c,f 0 0 1 0 1
4 5 c,f 0 0 1 0 1
回答2:
When you use apply
, it calls your function once for each column, with that column as an argument. So x
in your NewCols will be set to a single column. When you do x[string] = list.count(string)
, you are adding values to that column. Since apply
is called for each column, you wind up appending the values to both columns in this way.
apply
is not the right choice when your computation depends only on the values of a single column. Instead, use map
. In this case, what you need to do is write a NewCol function that accepts a single Column2
value and returns the data for a single row. You can return this as a dict, or, handily, a dict-like object such as a collections.Counter
. Then you need to wrap this new row data into a DataFrame and attach it column-wise to your existing data using concat
. Here is an example:
def NewCols(val):
return collections.Counter(val.split(','))
>>> pandas.concat([d, pandas.DataFrame.from_records(d.Column2.map(NewCols))], axis=1)
Column1 Column2 a b c e f
0 1 a,b,c 1 1 1 NaN NaN
1 2 a,e 1 NaN NaN 1 NaN
2 3 a 1 NaN NaN NaN NaN
3 4 c,f NaN NaN 1 NaN 1
4 5 c,f NaN NaN 1 NaN 1
For this particular computation, you actually don't need to write your own function at all, because pandas has split
built in as an operation under the .str
method accessor. So you can do this:
>>> pandas.concat([d, pandas.DataFrame.from_records(d.Column2.str.split(',').map(collections.Counter))], axis=1)
Column1 Column2 a b c e f
0 1 a,b,c 1 1 1 NaN NaN
1 2 a,e 1 NaN NaN 1 NaN
2 3 a 1 NaN NaN NaN NaN
3 4 c,f NaN NaN 1 NaN 1
4 5 c,f NaN NaN 1 NaN 1
回答3:
You could use something as:
import pandas as pd
import sklearn.feature_extraction.text
vect = sklearn.feature_extraction.text.CountVectorizer(binary=True, token_pattern=u'(?u)\\b\\w+\\b')
df = ...
v = [a for a in df['Column2']]
new_df = df.combine_first( pd.DataFrame(vect.fit_transform(v).todense(), columns=vect.get_feature_names()) )
print new_df
Cheers!