R and Python Give Different Results (Median, IQR,

2020-05-09 02:55发布

问题:

I am doing feature scaling on my data and R and Python are giving me different answers in the scaling. R and Python give different answers for the many statistical values:

Median: Numpy gives 14.948499999999999 with this code:np.percentile(X[:, 0], 50, interpolation = 'midpoint'). The built in Statistics package in Python gives the same answer with the following code: statistics.median(X[:, 0]). On the other hand, R gives this results 14.9632 with this code: median(X[, 1]). Interestingly, the summary() function in R gives 14.960 as the median.

A similar difference occurs when computing the mean of this same data. R gives 13.10936 using the built-in mean() function and both Numpy and the Python Statistics package give 13.097945407088607.

Again, the same thing happens when computing the Standard Deviation. R gives 7.390328 and Numpy (with DDOF = 1) gives 7.3927612774052083. With DDOF = 0, Numpy gives 7.3927565984408936.

The IQR also gives different results. Using the built-in IQR() function in R, the given results is 12.3468. Using Numpy with this code: np.percentile(X[:, 0], 75) - np.percentile(X[:, 0], 25) the results is 12.358700000000002.

What is going on here? Why are Python and R always giving different results? It may help to know that my data has 795066 rows and is being treated as an np.array() in Python. The same data is being treated as a matrix in R.

回答1:

tl;dr there are a few potential differences in algorithms even for such simple summary statistics, but given that you're seeing differences across the board and even in relatively simple computations such as the median, I think the problem is more likely that the values are getting truncated/modified/losing precision somehow in the transfer between platforms.

(This is more of an extended comment than an answer, but it was getting awkwardly long.)

  • you're unlikely to get much farther without a reproducible example; there are various ways to create examples to test hypotheses for the differences, but it's better if you do so yourself rather than making answerers do it.

  • how are you transferring data to/from Python/R? Is there some rounding in the representation used in the transfer? (What do you get for max/min, which should be based on a single number with no floating-point computations? How about if you drop one value to get an odd-length vector and take the median?)

  • medians: I was originally going to say that this could be a function of different ways to define quantile interpolation for an even-length vector, but the definition of the median is somewhat simpler than general quantiles, so I'm not sure. The differences you're reporting above seem way too big to be driven by floating-point computation in this case (since the computation is just an average of two values of similar magnitude).

  • IQRs: similarly, there are different possible definitions of percentiles/quantiles: see ?quantile in R.

  • median() vs summary(): R's summary() reports values at reduced precision (often useful for a quick overview); this is a common source of confusion.

  • mean/sd: there are some possible subtleties in the algorithm here -- for example, R sorts the vector before summing uses extended precision internally to reduce instability, I don't know if Python does or not. However, this shouldn't make as big a difference as you're seeing unless the data are a bit weird:

 x <- rnorm(1000000,mean=0,sd=1)
 > mean(x)
 [1] 0.001386724
 > sum(x)/length(x)
 [1] 0.001386724
 > mean(x)-sum(x)/length(x)
 [1] -1.734723e-18

Similarly, there are more- and less-stable ways to compute a variance/standard deviation.