Improve performance of a for loop in Python (possi

I want to improve the performance of the for loop in this function.

import numpy as np
import random

def play_game(row, n=1000000):
    """Play the game! This game is a kind of random walk.

    Arguments:
        row (int[]): row index to use in the p matrix for each step in the
                     walk. Then length of this array is the same as n.

        n (int): number of steps in the random walk
    """
    p = np.array([[ 0.499,  0.499,  0.499],
                  [ 0.099,  0.749,  0.749]])
    X0 = 100
    Y0 = X0 % 3
    X = np.zeros(n)
    tempX = X0
    Y = Y0

    for j in range(n):
        tempX = X[j] = tempX + 2 * (random.random() < p.item(row.item(j), Y)) - 1
        Y = tempX % 3

    return np.r_[X0, X]

The difficulty lies in the fact that the value of Y is computed at each step based on the value of X and that Y is then used in the next step to update the value for X.

I wonder if there is some numpy trick that could make a big difference. Using Numba is fair game (I tried it but without much success). However, I do not want to use Cython.

A quick oberservation tells us that there is data dependency between iterations in the function code. Now, there are different kinds of data dependencies. The kind of data dependency you are looking at is indexing dependency that is data selection at any iteration depends on the previous iteration calculations. This dependency seemed difficult to trace between iterations, so this post isn't really a vectorized solution. Rather, we would try to pre-compute values that would be used within the loop, as much as possible. The basic idea is to do minimum work inside the loop.

Here's a brief explanation on how we can proceed with pre-calculations and thus have a more efficient solution :

Given, the relatively small shape of p from which row elements are to be extracted based on the input row, you can pre-select all those rows from p with p[row].
For each iteration, you are calculating a random number. You can replace this with a random array that you can setup before the loop and thus, you would have precalculated those random values as well.
Based on the precalculated values thus far, you would have the column indices for all rows in p. Note that these column indices would be a large ndarray containing all possible column indices and inside our code, only one would be chosen based on per-iteration calculations. Using the per-iteration column indices, you would increment or decrement X0 to get per-iteration output.

The implementation would look like this -

randarr = np.random.rand(n)
p = np.array([[ 0.499,  0.419,  0.639],
              [ 0.099,  0.749,  0.319]])

def play_game_partvect(row,n,randarr,p):

    X0 = 100
    Y0 = X0 % 3

    signvals = 2*(randarr[:,None] < p[row]) - 1
    col_idx = (signvals + np.arange(3)) % 3

    Y = Y0
    currval = X0
    out = np.empty(n+1)
    out[0] = X0
    for j in range(n):
        currval = currval + signvals[j,Y]
        out[j+1] = currval
        Y = col_idx[j,Y]

    return out

For verification against the original code, you would have the original code modified like so -

def play_game(row,n,randarr,p):
    X0 = 100
    Y0 = X0 % 3
    X = np.zeros(n)
    tempX = X0
    Y = Y0
    for j in range(n):
        tempX = X[j] = tempX + 2 * (randarr[j] < p.item(row.item(j), Y)) - 1
        Y = tempX % 3
    return np.r_[X0, X]

Please note that since this code precomputes those random values, so this already would give you a good speedup over the code in the question.

Runtime tests and output verification -

In [2]: # Inputs
   ...: n = 1000
   ...: row = np.random.randint(0,2,(n))
   ...: randarr = np.random.rand(n)
   ...: p = np.array([[ 0.499,  0.419,  0.639],
   ...:               [ 0.099,  0.749,  0.319]])
   ...: 

In [3]: np.allclose(play_game_partvect(row,n,randarr,p),play_game(row,n,randarr,p))
Out[3]: True

In [4]: %timeit play_game(row,n,randarr,p)
100 loops, best of 3: 11.6 ms per loop

In [5]: %timeit play_game_partvect(row,n,randarr,p)
1000 loops, best of 3: 1.51 ms per loop

In [6]: # Inputs
   ...: n = 10000
   ...: row = np.random.randint(0,2,(n))
   ...: randarr = np.random.rand(n)
   ...: p = np.array([[ 0.499,  0.419,  0.639],
   ...:               [ 0.099,  0.749,  0.319]])
   ...: 

In [7]: np.allclose(play_game_partvect(row,n,randarr,p),play_game(row,n,randarr,p))
Out[7]: True

In [8]: %timeit play_game(row,n,randarr,p)
10 loops, best of 3: 116 ms per loop

In [9]: %timeit play_game_partvect(row,n,randarr,p)
100 loops, best of 3: 14.8 ms per loop

Thus, we are seeing a speedup of about 7.5x+, not bad!