performance of NumPy with different BLAS implement

I'm running an algorithm that is implemented in Python and uses NumPy. The most computationally expensive part of the algorithm involves solving a set of linear systems (i.e. a call to numpy.linalg.solve(). I came up with this small benchmark:

import numpy as np
import time

# Create two large random matrices
a = np.random.randn(5000, 5000)
b = np.random.randn(5000, 5000)

t1 = time.time()
# That's the expensive call:
np.linalg.solve(a, b)
print time.time() - t1

I've been running this on:

My laptop, a late 2013 MacBook Pro 15" with 4 cores at 2GHz (sysctl -n machdep.cpu.brand_string gives me Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz)
An Amazon EC2 c3.xlarge instance, with 4 vCPUs. Amazon advertises them as "High Frequency Intel Xeon E5-2680 v2 (Ivy Bridge) Processors"

Bottom line:

On the Mac it runs in ~4.5 seconds
On the EC2 instance it runs in ~19.5 seconds

I have tried it also on other OpenBLAS / Intel MKL based setups, and the runtime is always comparable to what I get on the EC2 instance (modulo the hardware config.)

Can anyone explain why the performance on Mac (with the Accelerate Framework) is > 4x better? More details about the NumPy / BLAS setup in each are provided below.

Laptop setup

numpy.show_config() gives me:

atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

EC2 instance setup:

On Ubuntu 14.04, I installed OpenBLAS with

sudo apt-get install libopenblas-base libopenblas-dev

When installing NumPy, I created a site.cfg with the following contents:

[default]
library_dirs= /usr/lib/openblas-base

[atlas]
atlas_libs = openblas

numpy.show_config() gives me:

atlas_threads_info:
    libraries = ['lapack', 'openblas']
    library_dirs = ['/usr/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = f77
    include_dirs = ['/usr/include/atlas']
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'openblas']
    library_dirs = ['/usr/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = f77
    include_dirs = ['/usr/include/atlas']
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

The reason for this behavior could be that Accelerate uses multithreading, while the others don't.

Most BLAS implementations follow the environment variable OMP_NUM_THREADS to determine how many threads to use. I believe they only use 1 thread if not told otherwise explicitly. Accelerate's man page, however sounds like threading is turned on by default; it can be turned off by setting the environment variable VECLIB_MAXIMUM_THREADS.

To determine if this is really what's happening, try

export VECLIB_MAXIMUM_THREADS=1

before calling the Accelerate version, and

export OMP_NUM_THREADS=4

for the other versions.

Independent of whether this is really the reason, it's a good idea to always set these variables when you use BLAS to be sure you control what is going on.