Python Pandas Dataframe: Normalize data between 0.

2020-07-25 10:39发布

问题:

I am trying to bound every value in a dataframe between 0.01 and 0.99

I have successfully normalised the data between 0 and 1 using: .apply(lambda x: (x - x.min()) / (x.max() - x.min())) as follows:

df = pd.DataFrame({'one' : ['AAL', 'AAL', 'AAPL', 'AAPL'], 'two' : [1, 1, 5, 5], 'three' : [4,4,2,2]})

df[['two', 'three']].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

df

Now I want to bound all values between 0.01 and 0.99

This is what I have tried:

def bound_x(x):
    if x == 1:
        return x - 0.01
    elif x < 0.99:
        return x + 0.01

df[['two', 'three']].apply(bound_x)

​ df

But I receive the following error:

ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', u'occurred at index two')

回答1:

There's an app, err clip method, for that:

import pandas as pd
df = pd.DataFrame({'one' : ['AAL', 'AAL', 'AAPL', 'AAPL'], 'two' : [1, 1, 5, 5], 'three' : [4,4,2,2]})    
df = df[['two', 'three']].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
df = df.clip(lower=0.01, upper=0.99)

yields

    two  three
0  0.01   0.99
1  0.01   0.99
2  0.99   0.01
3  0.99   0.01

The problem with

df[['two', 'three']].apply(bound_x)

is that bound_x gets passed a Series like df['two'] and then if x == 1 requires x == 1 be evaluated in a boolean context. x == 1 is a boolean Series like

In [44]: df['two'] == 1
Out[44]: 
0    False
1    False
2     True
3     True
Name: two, dtype: bool

Python tries to reduce this Series to a single boolean value, True or False. Pandas follows the NumPy convention of raising an error when you try to convert a Series (or array) to a bool.



回答2:

So I had a similar problem where I wanted customized normalization in that I regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! So i built a custom function (used extra steps in the code here to make it as readable as possible):

def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):    
    if low=='min':
        low=min(s)
    elif low=='abs':
        low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
    if hi=='max':
        hi=max(s)
    elif hi=='abs':
        hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))

    if center=='mid':
        center=(max(s)+min(s))/2
    elif center=='avg':
        center=mean(s)
    elif center=='median':
        center=median(s)

    s2=[x-center for x in s]
    hi=hi-center
    low=low-center
    center=0.

    r=[]

    for x in s2:
        if x<low:
            r.append(0.)
        elif x>hi:
            r.append(1.)
        else:
            if x>=center:
                r.append((x-center)/(hi-center)*0.5+0.5)
            else:
                r.append((x-low)/(center-low)*0.5+0.)

    if insideout==True:
        ir=[(1.-abs(z-0.5)*2.) for z in r]
        r=ir

    rr =[x-(x-0.5)*shrinkfactor for x in r]    
    return rr

This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our "10" is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:

#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]

It can also turn your data inside out... this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:

#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]

So now "2" which is closest to the center, defined as "1" is the highest value.

Anyways, I thought my issue was very similar to yours and this function could be useful to you.