可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?
PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05
Edit:
What I want:
Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.
I found a piece of code in Python declaring that it could compute the quantile normalization:
import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')
matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)
The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.
Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.
回答1:
Using the example dataset from Wikipedia article:
df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
df
Out:
C1 C2 C3
A 5 4 3
B 2 1 4
C 3 4 6
D 4 2 8
For each rank, the mean value can be calculated with the following:
rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()
rank_mean
Out:
1 2.000000
2 3.000000
3 4.666667
4 5.666667
dtype: float64
Then the resulting Series, rank_mean
, can be used as a mapping for the ranks to get the normalized results:
df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out:
C1 C2 C3
A 5.666667 4.666667 2.000000
B 2.000000 2.000000 3.000000
C 3.000000 4.666667 4.666667
D 4.666667 3.000000 5.666667
回答2:
Ok I implemented the method myself of relatively high efficiency.
After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.
The code is in github: Quantile Normalize
回答3:
Possibly more robust to use the median on each row rather than mean (based on code from Shawn. L):
def quantileNormalize(df_input):
df = df_input.copy()
#compute rank
dic = {}
for col in df:
dic[col] = df[col].sort_values(na_position='first').values
sorted_df = pd.DataFrame(dic)
#rank = sorted_df.mean(axis = 1).tolist()
rank = sorted_df.median(axis = 1).tolist()
#sort
for col in df:
# compute percentile rank [0,1] for each score in column
t = df[col].rank( pct=True, method='max' ).values
# replace percentile values in column with quantile normalized score
# retrieve q_norm score using calling rank with percentile value
df[col] = [ np.nanpercentile( rank, i*100 ) if ~np.isnan(i) else np.nan for i in t ]
return df
回答4:
The code below gives identical result as preprocessCore::normalize.quantiles.use.target
and I find it simpler clearer than the solutions above. Also performance should be good up to huge array lengths.
import numpy as np
def quantile_normalize_using_target(x, target):
"""
Both `x` and `target` are numpy arrays of equal lengths.
"""
target_sorted = np.sort(target)
return target_sorted[x.argsort().argsort()]
Once you have a pandas.DataFrame
easy to do:
quantile_normalize_using_target(df[0].as_matrix(),
df[1].as_matrix())
(Normalizing the first columnt to the second one as a reference distribution in the example above.)
回答5:
I am new to pandas and late to the question, but I think answer might also be of use. It builds off of the great answer from @ayhan:
def quantile_normalize(dataframe, cols, pandas=pd):
# copy dataframe and only use the columns with numerical values
df = dataframe.copy().filter(items=cols)
# columns from the original dataframe not specified in cols
non_numeric = dataframe.filter(items=list(filter(lambda col: col not in cols, list(dataframe))))
rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()
norm = df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
result = pandas.concat([norm, non_numeric], axis=1)
return result
the main difference here is closer to some real world applications. Often you just have matrices of numerical data in which case the original answer is sufficient.
Sometimes you have text based data in there as well. This lets you specify the columns cols
of your numerical data and will run quantile normalization on those columns. At the end it will merge back the non-numeric (or not to be normalized) columns from your original data frame.
e.g. if you added some 'meta-data' (char
) to the wiki example:
df = pd.DataFrame({
'rep1': [5, 2, 3, 4],
'rep2': [4, 1, 4, 2],
'rep3': [3, 4, 6, 8],
'char': ['gene_a', 'gene_b', 'gene_c', 'gene_d']
}, index = ['a', 'b', 'c', 'd'])
you can then call
quantile_normalize(t, ['rep1', 'rep2', 'rep3'])
to get
rep1 rep2 rep3 char
a 5.666667 4.666667 2.000000 gene_a
b 2.000000 2.000000 3.000000 gene_b
c 3.000000 4.666667 4.666667 gene_c
d 4.666667 3.000000 5.666667 gene_d