Assembly program crashes on call or exit

Debugging my code in VS2015, I get to the end of the program. The registers are as they should be, however, on call ExitProcess, or any variation of that, causes an "Access violation writing location 0x00000004." I am utilizing Irvine32.inc from Kip Irvine's book. I have tried using call DumpRegs, but that too throws the error.

I have tried using other variations of call ExitProcess, such as exit and invoke ExitProcess,0 which did not work either, throwing the same error. Before, when I used the same format, the code worked fine. The only difference between this code and the last one is utilizing the general purpose registers.

include Irvine32.inc

.data
        ;ary        dword   100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
        ary     dword   -24, 1, -5, 30, 35, 81, 94, 143, 0

.code
main PROC
                                ;ESI will be used for the array
                                ;EDI will be used for the array value
                                ;ESP will be used for the array counting
                                ;EAX will be used for the accumulating sum
                                ;EBX will be used for the average
                                ;ECX will be used for the remainder of avg
                                ;EBP will be used for calculating remaining sum
        mov     eax,0           ;Set EAX register to 0
        mov     ebx,0           ;Set EBX register to 0
        mov     esp,0           ;Set ESP register to 0
        mov     esi,OFFSET ary  ;Set ESI register to array
sum:    mov     edi,[esi]       ;Set value to array value
        cmp     edi,0           ;Check value to temination value 0
        je      finsum          ;If equal, jump to finsum
        add     esp,1           ;Add 1 to array count
        add     eax,edi         ;Add value to sum
        add     esi,4           ;Increment to next address in array
        jmp     sum             ;Loop back to sum array
finsum: mov     ebp,eax         ;Set remaining sum to the sum
        cmp     ebp,0           ;Compare rem sum to 0
        je      finavg          ;Jump to finavg if sum is 0
        cmp     ebp,esp         ;Check sum to array count
        jl      finavg          ;Jump to finavg if sum is less than array count
avg:    add     ebx,1           ;Add to average
        sub     ebp,esp         ;Subtract array count from rem sum
        cmp     ebp,esp         ;Compare rem sum to array count
        jge     avg             ;Jump to avg if rem sum is >= to ary count
finavg: mov     ecx,ebp         ;Set rem sum to remainder of avg

        call ExitProcess
main ENDP
END MAIN

Registers before call ExitProcess

EAX = 00000163 EBX = 0000002C ECX = 00000003 EDX = 00401055
ESI = 004068C0 EDI = 00000000 EIP = 0040366B ESP = 00000008
EBP = 00000003 EFL = 00000293 

OV = 0 UP = 0 EI = 1 PL = 1 ZR = 0 AC = 1 PE = 0 CY = 1

mov esp,0 sets the stack pointer to 0. Any stack instructions like push/pop or call/ret will crash after you do that.

Pick a different register for your array-count temporary, not the stack pointer! You have 7 other choices, looks like you still have EDX unused.

In the normal calling convention, only EAX, ECX, and EDX are call-clobbered (so you can use them without preserving the caller's value). But you're calling ExitProcess instead of returning from main, so you can destroy all the registers. But ESP has to be valid when you call.

call works by pushing a return address onto the stack, like sub esp,4 / mov [esp], next_instruction / jmp ExitProcess. See https://www.felixcloutier.com/x86/CALL.html. As your register-dump shows, ESP=8 before the call, which is why it's trying to store to absolute address 4.

Your code has 2 sections: looping over the array and then finding the average. You can reuse a register for different things in the 2 sections, often vastly reducing register pressure. (i.e. you don't run out of registers.)

Using implicit-length arrays (terminated by a sentinel element like 0) is unusual outside of strings. It's much more common to pass a function a pointer + length, instead of just a pointer.

But anyway, you have an implicit-length array so you have to find its length and remember that when calculating the average. Instead of incrementing a size counter inside the loop, you can calculate it from the pointer you're also incrementing. (Or use the counter as an array index like ary[ecx*4], but pointer-increments are often more efficient.)

Here's what an efficient (scalar) implementation might look like. (With SSE2 for SIMD you could add 4 elements with one instruction...)

It only uses 3 registers total. I could have used ECX instead of ESI (so main could ret without having destroyed any of the registers the caller expected it to preserve, only EAX, ECX, and EDX), but I kept ESI for consistency with your version.

.data
        ;ary        dword   100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
        ary     dword   -24, 1, -5, 30, 35, 81, 94, 143, 0

.code
main PROC
;; inputs: static ary of signed dword integers
;; outputs: EAX = array average, EDX = remainder of sum/size
;;          ESI = array count (in elements)
;; clobbers: none (other than the outputs)

                                ; EAX = sum accumulator
                                ; ESI = array pointer
                                ; EDX = array element temporary

        xor     eax, eax        ; sum = 0
        mov     esi, OFFSET ary ; incrementing a pointer is usually efficient, vs. ary[ecx*4] inside a loop or something.  So this is good.
sumloop:                       ; do {
        mov     edx, [esi]
        add     edx, 4
        add     eax, edx        ; sum += *p++  without checking for 0, because + 0 is a no-op

        test    edx, edx        ; sets FLAGS the same as cmp edx,0
        jnz     sumloop         ; }while(array element != 0);
;;; fall through if the element is 0.
;;; esi points to one past the terminator, i.e. two past the last real element we want to count for the average

        sub     esi, OFFSET ary + 4  ; (end+4) - (start+4) = array size in bytes
        shr     esi, 2          ; esi = array length = (end-start)/element_size

        cdq                     ; sign-extend sum into EDX:EAX as an input for idiv
        idiv     esi            ; EAX = sum/length   EDX = sum%length

        call ExitProcess
main ENDP

I used x86's hardware division instruction, instead of a subtraction loop. Your repeated-subtraction loop looked pretty complicated, but manual signed division can be tricky. I don't see where you're handling the possibility of the sum being negative. If your array had a negative sum, repeated subtraction would make it grow until it overflowed. Or in your case, you're breaking out of the loop if sum < count, which will be true on the first iteration for a negative sum.

Note that comments like Set EAX register to 0 are useless. We already know that from reading mov eax,0. sum = 0 describes the semantic meaning, not the architectural effect. There are some tricky x86 instructions where it does make sense to comment about what it even does in this specific case, but mov isn't one of them.

If you just wanted to do repeated subtraction with the assumption that sum is non-negative to start with, it's as simple as this:

;; UNSIGNED division  (or signed with non-negative dividend and positive divisor)
; Inputs: sum(dividend) in EAX,  count(divisor) in ECX
; Outputs: quotient in EDX, remainder in EAX  (reverse of the DIV instruction)
    xor    edx, edx                 ; quotient counter = 0
    cmp    eax, ecx
    jb     subloop_end              ; the quotient = 0 case
repeat_subtraction:                 ; do {
    inc    edx                      ;   quotient++
    sub    eax, ecx                 ;   dividend -= divisor
    cmp    eax, ecx
    jae    repeat_subtraction       ; while( dividend >= divisor );
     ; fall through when eax < ecx (unsigned), leaving EAX = remainder
subloop_end:

Notice how checking for special cases before entering the loop lets us simplify it. See also Why are loops always compiled into "do...while" style (tail jump)?

sub eax, ecx and cmp eax, ecx in the same loop seems redundant: we could just use sub to set flags, and correct for the overshoot.

    xor    edx, edx                 ; quotient counter = 0
    cmp    eax, ecx
    jb     division_done            ; the quotient = 0 case
repeat_subtraction:                 ; do {
    inc    edx                      ;   quotient++
    sub    eax, ecx                 ;   dividend -= divisor
    jnc    repeat_subtraction       ; while( dividend -= divisor doesn't wrap (carry) );

    add    eax, ecx                 ; correct for the overshoot
    dec    edx
division_done:

(But this isn't actually faster in most cases on most modern x86 CPUs; they can run the inc, cmp, and sub in parallel even if the inputs weren't the same. This would maybe help on AMD Bulldozer-family where the integer cores are pretty narrow.)

Obviously repeated subtraction is total garbage for performance with large numbers. It is possible to implement better algorithms, like one-bit-at-a-time long-division, but the idiv instruction is going to be faster for anything except the case where you know the quotient is 0 or 1, so it takes at most 1 subtraction. (div/idiv is pretty slow compared to any other integer operation, but the dedicated hardware is much faster than looping.)

If you do need to implement signed division manually, normally you record the signs, take the unsigned absolute value, then do unsigned division.

e.g. xor eax, ecx / sets dl gives you dl=0 if EAX and ECX had the same sign, or 1 if they were different (and thus the quotient will be negative). (SF is set according to the sign bit of the result, and XOR produces 1 for different inputs, 0 for same inputs.)