Two values can't be printed correctly (python3

There is an array, I'll do some calculation with it in GPU.

Before my calculation, I should get the subsets of this array.

When I print the subsets, find two values are not right.

The code is as follows:

import os,sys,time
import pandas as pd
import numpy as np

from numba import cuda, float32

os.environ['NUMBAPRO_NVVM']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\bin\nvvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\libdevice'

bpg = (3,1)  
tpb = (2,2)  

@cuda.jit
def calcu_TE(D,TE):
    gw = cuda.gridDim.x

    bx = cuda.blockIdx.x

    tx = cuda.threadIdx.x
    bw = cuda.blockDim.x
    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y

    c_num = D.shape[0]
    #print(c_num)
    c_index = bx
    while c_index<c_num*c_num:
        c_x = int(c_index/c_num)
        c_y = c_index%c_num
        if c_x==c_y:
            TE[0] = 0.0
        else:
            X = D[c_x,:]
            Y = D[c_y,:]
            if bx==1 :
                print('c_index,bx,tx,ty,X: ',c_index,bx,tx,ty,'  ',X[0],X[1],X[2],X[3],X[4],X[5],X[6],X[7],X[8],X[9])
                print('c_index,bx,tx,ty,Y: ',c_index,bx,tx,ty,'  ',Y[0],Y[1],Y[2],Y[3],Y[4],Y[5],Y[6],Y[7],Y[8],Y[9])
            #print('c_index,bx,tx,ty,Y: ',c_index,bx,tx,ty,Y[0],Y[1],Y[2],Y[3],Y[4],Y[5],Y[6],Y[7],Y[8],Y[9])
            h = tx
            if h==0:
                Xi = X[1:]
                Xi1 = X[:-1]
                Yi = Y[1:]
                if bx==1 :
                    print('bx,tx,ty: ',bx,tx,ty,'\n Xi',Xi[0],Xi[1],Xi[2],Xi[3],Xi[4],Xi[5],Xi[6],Xi[7],Xi[8],
                          '\n Xi1',Xi1[0],Xi1[1],Xi1[2],Xi1[3],Xi1[4],Xi1[5],Xi1[6],Xi1[7],Xi1[8],
                          '\n Yi',Yi[0],Yi[1],Yi[2],Yi[3],Yi[4],Yi[5],Yi[6],Yi[7],Yi[8])
        c_index +=gw


D = np.array([[ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,0.43940639,0.43864551,0.43447691,0.43120232],
              [2.989578,2.834707,2.942902,3.294948,2.868170,2.975180,3.066900,2.712719,2.835360,2.607334]], dtype=np.float32)
TE = np.empty([1,1])
print('D: ',D)

stream = cuda.stream()
with stream.auto_synchronize():
    dD = cuda.to_device(D, stream)
    dTE = cuda.to_device(TE, stream)
    calcu_TE[bpg, tpb, stream](dD,dTE)

The output is:

D:  [[ 0.42487645  0.41607881  0.42027071  0.43751907  0.43512794  0.43656972
   0.43940639  0.43864551  0.43447691  0.43120232]
 [ 2.98957801  2.83470702  2.94290209  3.2949481   2.86817002  2.97517991
   3.06690001  2.71271896  2.83536005  2.6073339 ]]
c_index,bx,tx,ty,X:  1 1 0 0    0.424876 0.416079 0.420271 0.437519 0.435128 0.436570 0.439406 0.438646 0.434477 0.431202
c_index,bx,tx,ty,X:  1 1 1 0    0.424876 0.416079 0.420271 0.437519 0.435128 0.436570 0.439406 0.438646 0.434477 0.431202
c_index,bx,tx,ty,X:  1 1 0 1    0.424876 0.416079 0.420271 0.437519 0.435128 0.436570 0.439406 0.438646 0.434477 0.431202
c_index,bx,tx,ty,X:  1 1 1 1    0.424876 0.416079 0.420271 0.437519 0.435128 0.436570 0.439406 0.438646 0.434477 0.431202
c_index,bx,tx,ty,Y:  1 1 0 0    2.989578 2.834707 2.942902 3.294948 2.868170 2.975180 3.066900 2.712719 2.835360 2.607334
c_index,bx,tx,ty,Y:  1 1 1 0    2.989578 2.834707 2.942902 3.294948 2.868170 2.975180 3.066900 2.712719 2.835360 2.607334
c_index,bx,tx,ty,Y:  1 1 0 1    2.989578 2.834707 2.942902 3.294948 2.868170 2.975180 3.066900 2.712719 2.835360 2.607334
c_index,bx,tx,ty,Y:  1 1 1 1    2.989578 2.834707 2.942902 3.294948 2.868170 2.975180 3.066900 2.712719 2.835360 2.607334

bx,tx,ty:  1 0 0
 Xi 0.416079 0.420271 0.437519 0.435128 0.436570 0.439406 0.438646 0.434477 0.431202
 Xi1 0.424876 0.416079 0.420271 0.437519 0.435128 0.436570 0.439406 0.438646 0.434477
 Yi 2.834707 2.942902 3.294948 2.868170 2.975180 3.066900 2.712719 0.000000 18949972373983835000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000
bx,tx,ty:  1 0 1
 Xi 0.416079 0.420271 0.437519 0.435128 0.436570 0.439406 0.438646 0.434477 0.431202
 Xi1 0.424876 0.416079 0.420271 0.437519 0.435128 0.436570 0.439406 0.438646 0.434477
 Yi 2.834707 2.942902 3.294948 2.868170 2.975180 3.066900 2.712719 0.000000 18949972373983835000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000

It's so strange.

Yi should be Yi 2.834707 2.942902 3.294948 2.868170 2.975180 3.066900 2.712719 2.835360 2.607334.

But it was printed Yi 2.834707 2.942902 3.294948 2.868170 2.975180 3.066900 2.712719 0.000000 18949972373983835000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000.

There are two values wrong.

I do't know why it happened. Is there anything I ignored?

This would appear to be a problem with the way the Numba compiler is producing code for the very long print statement in your kernel, and nothing to do with the correctness of your kernel. If you change the code like this (i.e. make the print statement shorter):

@cuda.jit
def calcu_TE(D,TE):
    gw = cuda.gridDim.x

    bx = cuda.blockIdx.x

    tx = cuda.threadIdx.x
    bw = cuda.blockDim.x
    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y

    c_num = D.shape[0]
    c_index = bx
    while c_index<c_num*c_num:
        c_x = int(c_index/c_num)
        c_y = c_index%c_num
        if c_x==c_y:
            TE[0] = 0.0
        else:
            X = D[c_x,:]
            Y = D[c_y,:]
            if bx==1 :
                print('c_index,bx,tx,ty,X: ',c_index,bx,tx,ty,'  ',X[0],X[1],X[2],X[3],X[4],X[5],X[6],X[7],X[8],X[9])
                print('c_index,bx,tx,ty,Y: ',c_index,bx,tx,ty,'  ',Y[0],Y[1],Y[2],Y[3],Y[4],Y[5],Y[6],Y[7],Y[8],Y[9])
            h = tx
            if h==0:
                Xi = X[1:]
                Xi1 = X[:-1]
                Yi = Y[1:]
                if bx==1 :
                    print('bx,tx,ty,Yi:',bx,tx,ty,'  ',Yi[0],Yi[1],Yi[2],Yi[3],Yi[4],Yi[5],Yi[6],Yi[7],Yi[8])
        c_index +=gw

You should find that Yi is printed correctly. In general, relying on print statements to instrument kernels in CUDA is a rather poor idea, and often you will only confuse yourself by doing so, as in this case.