Performance of numpy array masking in Cython

As a follow up of this question here (thanks MSeifert for your help) I came up with the problem that I have to mask a numpy array new_values with an index array new_vals_idx before passing the masked array to update val_dict.

To the proposed solutions in answer of MSeifert in the old post I tried to apply the array masking, but the performance is not satisfying.
The arrays and dicts I used for the following examples are:

import numpy as np
val_dict = {'a': 5.0, 'b': 18.8, 'c': -55/2}
for i in range(200):
    val_dict[str(i)] = i
    val_dict[i] = i**2

keys = ('b', 123, '89', 'c')  # dict keys to update
new_values = np.arange(1, 51, 1) / 1.0  # array with new values which has to be masked
new_vals_idx = np.array((0, 3, 5, -1))  # masking array
valarr = np.zeros((new_vals_idx.shape[0]))  # preallocation for masked array
length = new_vals_idx.shape[0]

To make my code-snippets easier to compare with my old question, I'll stick to the function naming of MSeifert's answer. These are my tries to get the best performance out of python/cython (the other answers were left out because of too poor performance):

def old_for(val_dict, keys, new_values, new_vals_idx, length):
    for i in range(length):
        val_dict[keys[i]] = new_values[new_vals_idx[i]]
%timeit old_for(val_dict, keys, new_values, new_vals_idx, length)
# 1000000 loops, best of 3: 1.6 µs per loop

def old_for_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length):
    valarr = new_values[new_vals_idx]
    for i in range(length):
        val_dict[keys[i]] = valarr[i]
%timeit old_for_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length)
# 100000 loops, best of 3: 2.33 µs per loop

def new2_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length):
    valarr = new_values[new_vals_idx].tolist()
    for key, val in zip(keys, valarr):
        val_dict[key] = val
%timeit new2_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length)
# 100000 loops, best of 3: 2.01 µs per loop

Cython functions:

%load_ext cython
%%cython
import numpy as np
cimport numpy as np
cpdef new3_cy(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
    cdef Py_ssize_t i
    cdef double val  # this gives about 10 µs speed boost compared to directly assigning it to val_dict
    for i in range(length):
        val = new_values[new_vals_idx[i]]
        val_dict[keys[i]] = val
%timeit new3_cy(val_dict, keys, new_values, new_vals_idx, length)
# 1000000 loops, best of 3: 1.38 µs per loop

cpdef new3_cy_mview(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
    cdef Py_ssize_t i
    cdef int[:] mview_idx = new_vals_idx
    cdef double [:] mview_vals = new_values
    for i in range(length):
        val_dict[keys[i]] = mview_vals[mview_idx[i]]
%timeit new3_cy_mview(val_dict, keys, new_values, new_vals_idx, length)
# 1000000 loops, best of 3: 1.38 µs per loop

# NOT WORKING:
cpdef new2_cy_mview(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
    cdef double [new_vals_idx] masked_vals = new_values
    for key, val in zip(keys, masked_vals.tolist()):
        val_dict[key] = val

cpdef new2_cy_mask(dict val_dict, tuple keys, double[:] new_values, valarr, int[:] new_vals_idx, Py_ssize_t length):
    valarr = new_values[new_vals_idx]
    for key, val in zip(keys, valarr.tolist()):
        val_dict[key] = val

The Cython functions new3_cy and new3_cy_mview do not seem to be considerably faster than old_for. Passing valarr to avoid array construction inside the function (as it is going to be called several million times) even seems to slow it down.
Masking in new2_cy_mask with the new_vals_idx array in Cython gives me the error: 'Invalid index for memoryview specified, type int[:]'. Is there any type like Py_ssize_t for arrays of indexes?
Trying to create a masked memoryview in new2_cy_mview gives me the error 'Cannot assign type 'double[:]' to 'double [__pyx_v_new_vals_idx]''. Is there even something like masked memoryviews? I wasn't able to find information on this topic...

Comparing the timing results with those from my old question I guess that the array masking is the process taking up most of the time. And as it is most likely already highly optimized in numpy, there is probably not much to do. But the slow-down is so huge, that there must be (hopefully) a better way to do it.
Any help is appreciated! Thanks in advance!

One thing you can do in the current construction is turn off bounds checking (if it's safe to!). Won't make a huge difference, but some incremental performance.

%%cython
import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef new4_cy(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
    cdef Py_ssize_t i
    cdef double val  # this gives about 10 µs speed boost compared to directly assigning it to val_dict
    for i in range(length):
        val = new_values[new_vals_idx[i]]
        val_dict[keys[i]] = val

In [36]: %timeit new3_cy(val_dict, keys, new_values, new_vals_idx, length)
1.76 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [37]: %timeit new4_cy(val_dict, keys, new_values, new_vals_idx, length)
1.45 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)