可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:

5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05

Edit:

What I want:

Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.

I found a piece of code in Python declaring that it could compute the quantile normalization:

import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')


matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)

The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.

Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.

回答1:

Using the example dataset from Wikipedia article:

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

df
Out: 
   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

For each rank, the mean value can be calculated with the following:

rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()

rank_mean
Out: 
1    2.000000
2    3.000000
3    4.666667
4    5.666667
dtype: float64

Then the resulting Series, rank_mean, can be used as a mapping for the ranks to get the normalized results:

df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out: 
         C1        C2        C3
A  5.666667  4.666667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  4.666667  4.666667
D  4.666667  3.000000  5.666667

回答2:

Ok I implemented the method myself of relatively high efficiency.

After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.

The code is in github: Quantile Normalize

回答3:

Possibly more robust to use the median on each row rather than mean (based on code from Shawn. L):

def quantileNormalize(df_input):
    df = df_input.copy()
    #compute rank
    dic = {}
    for col in df:
        dic[col] = df[col].sort_values(na_position='first').values
    sorted_df = pd.DataFrame(dic)
    #rank = sorted_df.mean(axis = 1).tolist()
    rank = sorted_df.median(axis = 1).tolist()
    #sort
    for col in df:
        # compute percentile rank [0,1] for each score in column 
        t = df[col].rank( pct=True, method='max' ).values
        # replace percentile values in column with quantile normalized score
        # retrieve q_norm score using calling rank with percentile value
        df[col] = [ np.nanpercentile( rank, i*100 ) if ~np.isnan(i) else np.nan for i in t ]
    return df

回答4:

The code below gives identical result as preprocessCore::normalize.quantiles.use.target and I find it simpler clearer than the solutions above. Also performance should be good up to huge array lengths.

import numpy as np

def quantile_normalize_using_target(x, target):
    """
    Both `x` and `target` are numpy arrays of equal lengths.
    """

    target_sorted = np.sort(target)

    return target_sorted[x.argsort().argsort()]

Once you have a pandas.DataFrame easy to do:

quantile_normalize_using_target(df[0].as_matrix(),
                                df[1].as_matrix())

(Normalizing the first columnt to the second one as a reference distribution in the example above.)

回答5:

I am new to pandas and late to the question, but I think answer might also be of use. It builds off of the great answer from @ayhan:

def quantile_normalize(dataframe, cols, pandas=pd):

    # copy dataframe and only use the columns with numerical values
    df = dataframe.copy().filter(items=cols)

    # columns from the original dataframe not specified in cols
    non_numeric = dataframe.filter(items=list(filter(lambda col: col not in cols, list(dataframe))))


    rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()  

    norm = df.rank(method='min').stack().astype(int).map(rank_mean).unstack()


    result = pandas.concat([norm, non_numeric], axis=1)
    return result

the main difference here is closer to some real world applications. Often you just have matrices of numerical data in which case the original answer is sufficient.

Sometimes you have text based data in there as well. This lets you specify the columns cols of your numerical data and will run quantile normalization on those columns. At the end it will merge back the non-numeric (or not to be normalized) columns from your original data frame.

e.g. if you added some 'meta-data' (char) to the wiki example:

df = pd.DataFrame({
    'rep1': [5, 2, 3, 4],
    'rep2': [4, 1, 4, 2],
    'rep3': [3, 4, 6, 8],
    'char': ['gene_a', 'gene_b', 'gene_c', 'gene_d']
}, index = ['a', 'b', 'c', 'd'])

you can then call

quantile_normalize(t, ['rep1', 'rep2', 'rep3'])

to get

    rep1        rep2        rep3        char
a   5.666667    4.666667    2.000000    gene_a
b   2.000000    2.000000    3.000000    gene_b
c   3.000000    4.666667    4.666667    gene_c
d   4.666667    3.000000    5.666667    gene_d