Multicore in NASM Windows: threads execute randoml

2019-09-21 20:39发布

问题:

I have code in NASM (64 bit) in Windows to run four simultaneous threads (each assigned to a separate core) on a four-core Windows x86-64 machine.

The threads are created in a loop. After thread creation, it calls WaitForMultipleObjects to coordinate the threads. The function to call is Test_Function (see code below).

Each thread (core) executes Test_Function across a large array. The first core starts at data element zero, the second core starts at 1, the third core starts at 2, the fourth core starts at 3, and each core increments by four (e.g., 0, 4, 8, 12).

In Test_Function I created a small test program that writes one of the input data values to the location corresponding to its startbyte, to verify that I have successfully created four threads and they return the correct data.

Each thread should write the stride value (32), but the test shows that the four fields are filled in randomly, with some fields showing as zero. If I repeat the test multiple times, I see there is no consistency to which fields will have the value 32 (the others always show as 0). That could be a side effect of WaitForMultipleObjects, but I haven't seen anything in the docs to confirm that.

Also, WaitForMultipleObjects waits on the ThreadHandles returned by CreateThread; when I examine the ThreadHandles array, it always shows like this: 268444374, 32, 1652, 1584. Only the first element looks like the size of a handle, the others do not look like handle values.

One possibility is that the two parameters passed on the stack may not be in the correct locations:

mov rax,0
mov [rsp+40],rax            ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax            ; ThreadID

According to the docs, ThreadCount should be a pointer. When I change the line to mov rax,ThreadCount (the pointer value), the program crashes. When I change it to:

mov rax,0
mov [rsp+32],rax            ; use default creation flags
mov rax,ThreadCount
mov [rsp+40],rax            ; ThreadID

now it reliably processes the first thread, but not threads 2-4.

So the bottom line is the threads are being created but they execute randomly, with some threads not executing at all, in no particular order. When I change the CreateThread parameters (as shown above) the first thread executes, but not threads 2-4.

Here is the test code showing the relevant parts. If a reproducible example is needed, I can prepare one.

Thanks for any ideas.

Init_Cores_fn:
; EACH OF THE CORES CALLS Test_Function AND EXECUTES THE WHOLE PROGRAM.  
; WE PASS THE STARTING BYTE (0, 8, 16, 24) AND THE "STRIDE" = NUMBER OF CORES.  
; ON RETURN, WE SYNCHRONIZE ANY DATA.  ON ENTRY TO EACH CORE, SET THE REGISTERS

; Populate the ThreadInfo array with vars to pass
; ThreadInfo: length, startbyte, stride, vars into registers on entry to each core
mov rdi,ThreadInfo
mov rax,ThreadInfoLength
mov [rdi],rax
mov rax,[stride]
mov [rdi+16],rax    ; 8 x number of cores (32 in this example)
; Register Vars
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10

mov rbp,rsp ; preserve caller's stack frame
sub rsp,56 ; Shadow space

; _____

label_0:

mov rdi,ThreadInfo
mov rax,[FirstByte]
mov [rdi+8],rax ; 0, 8, 16, or 24

; _____
; Create Threads

mov rcx,0               ; lpThreadAttributes (Security Attributes)
mov rdx,0               ; dwStackSize
mov r8,Test_Function        ; lpStartAddress (function pointer)
mov r9,ThreadInfo       ; lpParameter (array of data passed to each core)

mov rax,0
mov [rsp+40],rax            ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax            ; ThreadID

call CreateThread

; Move the handle into ThreadHandles array (returned in rax)
mov rdi,ThreadHandles
mov rcx,[FirstByte]
mov [rdi+rcx],rax

mov rax,[FirstByte]
add rax,8
mov [FirstByte],rax

mov rax,[ThreadCount]
add rax,1
mov [ThreadCount],rax

mov rbx,4
cmp rax,rbx
jl label_0

; _____
; Wait

mov rcx,rax         ; number of handles
mov rdx,ThreadHandles       ; pointer to handles array
mov r8,1                ; wait for all threads to complete
mov r9,1000         ; milliseconds to wait

call WaitForMultipleObjects

; _____

;[ Code HERE to do cleanup if needed after the four threads finish ]

mov rsp,rbp
jmp label_900

; __________________
; The function for all threads to call

Test_Function:

; Populate registers
mov rdi,rcx
mov rax,[rdi]
mov r15,[rdi+24]
mov rax,[rdi+8] ; start byte
mov r13,[rdi+40]
mov r12,[rdi+48]
mov r10,[rdi+56]
xor r11,r11
xor r9,r9
pxor xmm15,xmm15
pxor xmm15,xmm14
pxor xmm15,xmm13

; Now test it - BUT the first thread does not write data
mov rcx,[rdi+8] ; start byte
mov rax,[rdi+16] ; stride
cvtsi2sd xmm0,rax
movsd [r15+rcx],xmm0
ret

回答1:

I solved this problem, and here is the solution. Raymond Chen alluded to this in the comments above before urging me to use a higher level language, but I didn't understand it until today. I am posting this answer so it's easily accessible and understood by anyone who has the same problem in assembly language (or any other language) in the future because Raymond's comment (which I just upvoted) is now buried in the other comments above.

The ThreadInfo array, passed here as the fourth parameter to CreateThread (in r9 for Windows). Each core must have its own separate copy of ThreadInfo. In my application, the data in ThreadInfo are all the same except for the StartByte parameter (at rdi+8). Instead, I created a separate ThreadInfo array for each core (ThreadInfo1, 2, 3, and 4) and pass a pointer to the corresponding ThreadInfo array.

I implemented it in my application as a call to the following dup function but it could be implemented other ways as well:

DupThreadInfo:
mov rdi,ThreadInfo2
mov rax,8
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax    ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
; _____

mov rdi,ThreadInfo3
mov rax,0
mov [rdi],rax       ; length (number of vars into registers plus 3 elements)
mov rax,16
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax    ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10

mov rdi,ThreadInfo4
mov rax,0
mov [rdi],rax       ; length (number of vars into registers plus 3 elements)
mov rax,24
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax    ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
ret

Because all data in the ThreadInfo arrays are the same except the second element, a more efficient way to do this would be to pass a 2-element array where the first element is the StartByte and the second element is a pointer to the static ThreadInfo array. That's especially important when we are working with more than four cores because the DupThreadInfo section would be needlessly long. That solution would avoid a call, but I haven't implemented that yet.