How do I map df column values to hex color in one

2020-04-26 04:40发布

问题:

I have a pandas dataframe with two columns. One of the columns values needs to be mapped to colors in hex. Another graphing process takes over from there.

This is what I have tried so far. Part of the toy code is taken from here.

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

import seaborn as sns

# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan 

# Try to map values to colors in hex
# # Taken from here 
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)

df['some_value_color'] = df['some_value'].apply(lambda x: mapper.to_rgba(x))
df

Which outputs:

How do I convert 'some_value' df column values to hex in one go? Ideally using the sns.cubehelix_palette(light=1)

I am not opposed to using something other than matplotlib

Thanks in advance.

回答1:

You may use matplotlib.colors.to_hex() to convert a color to hexadecimal representation.

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

import seaborn as sns

# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan 

# Try to map values to colors in hex
# # Taken from here 
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)

df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
df


Efficiency

The above method it easy to use, but may not be very efficient. In the folling let's compare some alternatives.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

def create_df(n=10):
    # Create dataframe
    df = pd.DataFrame(np.random.randint(0,21,size=(n, 2)), 
                      columns=['some_value', 'another_value'])
    # Add a nan to handle realworld
    df.iloc[-1] = np.nan
    return df

The following is the solution from above. It applies the conversion to the dataframe row by row. This quite inefficient.

def apply1(df):
    # map values to colors in hex via
    # matplotlib to_hex by pandas apply
    norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values), 
                                       vmax=np.nanmax(df['some_value'].values), clip=True)
    mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)

    df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
    return df

That's why we might choose to calculate the values into a numpy array first and just assign this array as the newly created column.

def apply2(df):
    # map values to colors in hex via
    # matplotlib to_hex by assigning numpy array as column
    norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values), 
                                       vmax=np.nanmax(df['some_value'].values), clip=True)
    mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
    a = mapper.to_rgba(df['some_value'])
    df['some_value_color'] =  np.apply_along_axis(mcolors.to_hex, 1, a)
    return df

Finally we may use a look up table (LUT) which is created from the matplotlib colormap, and index the LUT by the normalized data. Because this solution needs to create the LUT first, it is rather ineffienct for dataframes with less entries than the LUT has colors, but will pay off for large dataframes.

def apply3(df):
    # map values to colors in hex via
    # creating a hex Look up table table and apply the normalized data to it
    norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values), 
                                       vmax=np.nanmax(df['some_value'].values), clip=True)
    lut = plt.cm.viridis(np.linspace(0,1,256))
    lut = np.apply_along_axis(mcolors.to_hex, 1, lut)
    a = (norm(df['some_value'].values)*255).astype(np.int16)
    df['some_value_color'] = lut[a]
    return df

Compare the timings Let's take a dataframe with 10000 rows. df = create_df(10000)

  • Original solution (apply1)

    %timeit apply1(df)
    2.66 s per loop
    
  • Array solution (apply2)

    %timeit apply2(df)
    240 ms per loop
    
  • LUT solution (apply3)

    %timeit apply1(df)
    7.64 ms per loop
    

In this case the LUT solution gives almost a factor 400 of improvement.