Iterating through a 2D array in PyCUDA

I am trying to iterate through a 2D array in PyCUDA but I end up with repeated array values. I initially throw a small random integer array and that works as expected but when I throw an image at it, I see the same values over and over again.

Here is my code

img = np.random.randint(20, size = (4,5))
print "Input array"
print img
img_size=img.shape
print img_size

#nbtes determines the number of bytes for the numpy array a
img_gpu = cuda.mem_alloc(img.nbytes)
#Copies the memory from CPU to GPU
cuda.memcpy_htod(img_gpu, img)


mod = SourceModule("""
#include <stdio.h>
__global__ void AHE(int *a, int row, int col)
{
int i = threadIdx.x+ blockIdx.x* blockDim.x;
int j = threadIdx.y+ blockIdx.y* blockDim.y;
if(i==0 && j ==0)
printf("Output array ");
if(i <row && j < col)
{
    printf(" %d",a[j + i*col]);
}
}
""")

col = np.int32(img.shape[-1])
row = np.int32(img.shape[0])
func = mod.get_function("AHE")
func(img_gpu, row, col, block=(32,32,1))
img_ahe = np.empty_like(img)
cuda.memcpy_dtoh(img_ahe, img_gpu)

Now when I replace the random integer array with an image converted to a numpy array I end up with this

img = cv2.imread('Chest.jpg',0)
img_size=img.shape
print img_size

#nbtes determines the number of bytes for the numpy array a
img_gpu = cuda.mem_alloc(img.nbytes)
#Copies the memory from CPU to GPU
cuda.memcpy_htod(img_gpu, img)

mod = SourceModule("""
#include <stdio.h>
__global__ void AHE(int *a, int row, int col)
{
int i = threadIdx.x+ blockIdx.x* blockDim.x;
int j = threadIdx.y+ blockIdx.y* blockDim.y;
if(i==0 && j ==0)
printf("Output array ");
if(i <row && j < col)
{
    printf(" %d",a[j + i*col]);
}
}
""")
#Gives you the number of columns
col = np.int32(img.shape[-1])
row = np.int32(img.shape[0])
func = mod.get_function("AHE")
func(img_gpu, row, col, block=(32,32,1))
img_ahe = np.empty_like(img)
cuda.memcpy_dtoh(img_ahe, img_gpu)

The problem here is that the image you are loading doesn't have pixel values stored as signed integers. This modification of your example works more as expected:

import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np
import cv2 

import pycuda.autoinit

img = cv2.imread('Chest.jpg',0)
img_size=img.shape
print img_size
print img.dtype

#nbtes determines the number of bytes for the numpy array a
img_gpu = cuda.mem_alloc(img.nbytes)
#Copies the memory from CPU to GPU
cuda.memcpy_htod(img_gpu, img)

mod = SourceModule("""
#include <stdio.h>
__global__ void AHE(unsigned char *a, int row, int col)
{
int i = threadIdx.x+ blockIdx.x* blockDim.x;
int j = threadIdx.y+ blockIdx.y* blockDim.y;
if(i==0 && j ==0)
printf("Output array ");
if(i <row && j < col)
{
    int val = int(a[j + i*col]);
    printf(" %d", val);
}
}
""")
#Gives you the number of columns
col = np.int32(img.shape[-1])
row = np.int32(img.shape[0])
func = mod.get_function("AHE")
func(img_gpu, row, col, block=(32,32,1))
img_ahe = np.empty_like(img)
cuda.memcpy_dtoh(img_ahe, img_gpu)

When run the code emits this:

$ python image.py 
(681, 1024)
uint8
Output array  244 244 244 244 244 244 244 244 244 244 244 244 244 244 244 244 244 244 245 245 245 246 246 246 246 246 246 246 246 246 246 246 244 244 244 244 244 244 244 244 245 245 245 245 245 245 245 245 244 244 245 245 245 246 246 246

[Output clipped for brevity]

Note the dtype of the image - uint8. Your code is attempting to treat the stream of unsigned 8 bit values as integers. It should technically generate a runtime error on a full image because the kernel will read beyond the size of image as it reads 4 bytes per pixel instead of 1. However, you don't see this because you only run a single block, and your input image is presumably at least four times larger than the 32 x 32 size of the block you run.

Incidentally, PyCUDA is extremely good at managing and enforcing type safety for CUDA calls, but your code neatly defeats every mechanism by which PyCUDA could detect a type mismatch in the kernel call. PyCUDA includes an excellent GPUarray class. You should familiarise yourself with it. If you had used a GPUarray instance here, you would have gotten type mismatch runtime errors which would have alerted you to the exact source of the problem the first time you tried to run it.