Vectorizing the multivariate normal CDF (cumulativ

2019-04-10 13:08发布

问题:

How can I vectorize the multivariate normal CDF (cumulative density function) in Python?

When looking at this post, I found out that there is a Fortran implementation of the multivariate CDF that was "ported" over to Python. This means I can easily evaluate the CDF for one specific case.

However, I'm having a lot of trouble efficiently applying this function to multiple entries.

Specifically speaking, the function I need to "vectorize" takes 4 arguments:

  • the lower bounds of integration(vector)
  • the upper bounds of integration (vector)
  • the means of the normal random variables (vector)
  • the covariance matrix of the normal random variables (matrix)

But I'm trying to efficiently evaluate this function over a list of 1000+ elements MANY times.

Here is some code to illustrate my problem. In the example below I'm just using random data to illustrate my point.

import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF

np.random.seed(666)

iters = 1000 # number of times the whole dataset will be evaluated
obs = 1500 # number of elements in the dataset
dim = 2 # dimension of multivariate normal distribution

lower = np.random.rand(obs,dim)
upper = lower + np.random.rand(obs,dim)
means = np.random.rand(obs,dim)

# Creates a symmetric matrix - used for the random covariance matrices
def make_sym_matrix(dim,vals):
    m = np.zeros([dim,dim])
    xs,ys = np.triu_indices(dim,k=1)
    m[xs,ys] = vals[:-dim]
    m[ys,xs] = vals[:-dim]
    m[ np.diag_indices(dim) ] = vals[-dim:]
    return m

# Generating the random covariance matrices
covs = []
for i in range(obs):
    cov_vals = np.random.rand((dim^2 - dim)/2+dim)
    cov_mtx = make_sym_matrix(dim,cov_vals)
    covs.append(cov_mtx)
covs = np.array(covs)

# Here is where the trouble starts.
time_start = time.time()
for i in range(iters):
    results = []
    for j in range(obs):
        this_p, this_i = mvnun(lower[j],upper[j],means[j],covs[j])
        results.append(this_p)
time_end = time.time()

print(time_end-time_start)

Here I have a dataset with 1500 observations which I'm evaluating 1000 times. In my machine, this takes 6.74399995804 seconds to calculate.

Note that I'm not trying to get rid of the outer for-loop (over i). I just created that to mimic my real problem. The for-loop that I'm really trying to eliminate is the internal one (over j).

The execution time could be vastly reduced if I found a way to efficiently evaluate the CDF over the whole dataset.

I know that the mvnun function was originally written in Fortran (original code here) and "ported" to Python using f2pye as can be seen here.

Can anyone help me out with this? I've started looking at theano, but it seems that the only option I have there is to use the scan function, which might not be much of an improvement either.

Thank you!!!