python pandas standardize column for regression

2019-07-27 09:29发布

问题:

I have the following df:

Date       Event_Counts   Category_A  Category_B
20170401      982457          0           1
20170402      982754          1           0
20170402      875786          0           1

I am preparing the data for a regression analysis and want to standardize the column Event_Counts, so that it's on a similar scale like the categories.

I use the following code:

from sklearn import preprocessing
df['scaled_event_counts'] = preprocessing.scale(df['Event_Counts'])

While I do get this warning:

DataConversionWarning: Data with input dtype int64 was converted to float64 by the scale function.
  warnings.warn(msg, _DataConversionWarning)

it seems to have worked; there is a new column. However, it has negative numbers like -1.3

What I thought the scale function does is subtract the mean from the number and divide it by the standard deviation for every row; then add the min of the result to every row.

Does it not work for pandas that way? Or should I use the normalize() function or StandardScaler() function? I wanted to have the standardize column on a scale of 0 to 1.

Thank You

回答1:

I think you are looking for the sklearn.preprocessing.MinMaxScaler. That will allow you to scale to a given range.

So in your case it would be:

scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
df['scaled_event_counts'] = scaler.fit_transform(df['Event_Counts'])

To scale the entire df:

scaled_df = scaler.fit_transform(df)
print(scaled_df)
[[ 0.          0.99722347  0.          1.        ]
 [ 1.          1.          1.          0.        ]
 [ 1.          0.          0.          1.        ]]


回答2:

Scaling is done by subtracting the mean and dividing by the standard deviation of each feature (column). So,

scaled_event_counts = (Event_Counts - mean(Event_Counts)) / std(Event_Counts)

The int64 to float64 warning comes from having to subtract the mean, which would be a floating point number, and not just an integer.

You will have negative numbers with the scaled column because the mean will be normalized to zero.