可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
If memory servies me, in R there is a data type called factor which when used within a DataFrame can be automatically unpacked into the necessary columns of a regression design matrix. For example, a factor containing True/False/Maybe values would be transformed into:
1 0 0
0 1 0
or
0 0 1
for the purpose of using lower level regression code. Is there a way to achieve something similar using the pandas library? I see that there is some regression support within Pandas, but since I have my own customised regression routines I am really interested in the construction of the design matrix (a 2d numpy array or matrix) from heterogeneous data with support for mapping back and fort between columns of the numpy object and the Pandas DataFrame from which it is derived.
Update: Here is an example of a data matrix with heterogeneous data of the sort I am thinking of (the example comes from the Pandas manual):
>>> df2 = DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],'c' : np.random.randn(7)})
>>> df2
a b c
0 one x 0.000343
1 one y -0.055651
2 two y 0.249194
3 three x -1.486462
4 two y -0.406930
5 one x -0.223973
6 six x -0.189001
>>>
The 'a' column should be converted into 4 floating point columns (in spite of the meaning, there are only four unique atoms), the 'b' column can be converted to a single floating point column, and the 'c' column should be an unmodified final column in the design matrix.
Thanks,
SetJmp
回答1:
There is a new module called patsy that solves this problem. The quickstart linked below solves exactly the problem described above in a couple lines of code.
Here is an example usage:
import pandas
import patsy
dataFrame = pandas.io.parsers.read_csv("salary2.txt")
#salary2.txt is a re-formatted data set from the textbook
#Introductory Econometrics: A Modern Approach
#by Jeffrey Wooldridge
y,X = patsy.dmatrices("sl ~ 1+sx+rk+yr+dg+yd",dataFrame)
#X.design_info provides the meta data behind the X columns
print X.design_info
generates:
> DesignInfo(['Intercept',
> 'sx[T.male]',
> 'rk[T.associate]',
> 'rk[T.full]',
> 'dg[T.masters]',
> 'yr',
> 'yd'],
> term_slices=OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('sx')]), slice(1, 2, None)),
> (Term([EvalFactor('rk')]), slice(2, 4, None)),
> (Term([EvalFactor('dg')]), slice(4, 5, None)),
> (Term([EvalFactor('yr')]), slice(5, 6, None)),
> (Term([EvalFactor('yd')]), slice(6, 7, None))]),
> builder=<patsy.build.DesignMatrixBuilder at 0x10f169510>)
回答2:
import pandas
import numpy as np
num_rows = 7;
df2 = pandas.DataFrame(
{
'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
'c' : np.random.randn(num_rows)
}
)
a_attribute_list = ['one', 'two', 'three', 'six']; #Or use list(set(df2['a'].values)), but that doesn't guarantee ordering.
b_attribute_list = ['x','y']
a_membership = [ np.reshape(np.array(df2['a'].values == elem).astype(np.float64), (num_rows,1)) for elem in a_attribute_list ]
b_membership = [ np.reshape((df2['b'].values == elem).astype(np.float64), (num_rows,1)) for elem in b_attribute_list ]
c_column = np.reshape(df2['c'].values, (num_rows,1))
design_matrix_a = np.hstack(tuple(a_membership))
design_matrix_b = np.hstack(tuple(b_membership))
design_matrix = np.hstack(( design_matrix_a, design_matrix_b, c_column ))
# Print out the design matrix to see that it's what you want.
for row in design_matrix:
print row
I get this output:
[ 1. 0. 0. 0. 1. 0. 0.36444463]
[ 1. 0. 0. 0. 0. 1. -0.63610264]
[ 0. 1. 0. 0. 0. 1. 1.27876991]
[ 0. 0. 1. 0. 1. 0. 0.69048607]
[ 0. 1. 0. 0. 0. 1. 0.34243241]
[ 1. 0. 0. 0. 1. 0. -1.17370649]
[ 0. 0. 0. 1. 1. 0. -0.52271636]
So, the first column is an indicator for the DataFrame locations that were 'one', the second column is an indicator for the DataFrame locations that were 'two', and so on. Columns 4 and 5 are indicators of DataFrame locations that were 'x' and 'y', respectively, and the final column is just the random data.
回答3:
Pandas 0.13.1 from February 3, 2014 has a method:
>>> pd.Series(['one', 'one', 'two', 'three', 'two', 'one', 'six']).str.get_dummies()
one six three two
0 1 0 0 0
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
6 0 1 0 0
回答4:
patsy.dmatrices
may in many cases work well. If you just have a vector - a pandas.Series
- then the below code may work producing a degenerate design matrix and without an intercept column.
def factor(series):
"""Convert a pandas.Series to pandas.DataFrame design matrix.
Parameters
----------
series : pandas.Series
Vector with categorical values
Returns
-------
pandas.DataFrame
Design matrix with ones and zeroes.
See Also
--------
patsy.dmatrices : Converts categorical columns to numerical
Examples
--------
>>> import pandas as pd
>>> design = factor(pd.Series(['a', 'b', 'a']))
>>> design.ix[0,'[a]']
1.0
>>> list(design.columns)
['[a]', '[b]']
"""
levels = list(set(series))
design_matrix = np.zeros((len(series), len(levels)))
for row_index, elem in enumerate(series):
design_matrix[row_index, levels.index(elem)] = 1
name = series.name or ""
columns = map(lambda level: "%s[%s]" % (name, level), levels)
df = pd.DataFrame(design_matrix, index=series.index,
columns=columns)
return df